INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Bora
    -0.08
     residual
    -0.08
    Residual
    -0.08
     Byzant
    -0.08
     nasled
    -0.07
     bespoke
    -0.07
     neglig
    -0.07
     boj
    -0.07
     matag
    -0.07
     paub
    -0.07
    POSITIVE LOGITS
     evidence
    0.10
     backing
    0.10
     തെള
    0.10
     evid
    0.09
    0.09
    0.09
    Evidence
    0.09
     Evidence
    0.09
    assistant
    0.09
     preuves
    0.09
    Act Density 0.019%

    No Known Activations