INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     teor
    -0.08
    imited
    -0.07
    anguage
    -0.07
     hurt
    -0.07
     Orig
    -0.07
    urt
    -0.07
     north
    -0.07
    eter
    -0.07
    raction
    -0.07
    Orientation
    -0.07
    POSITIVE LOGITS
     grasp
    0.07
     slander
    0.07
    лиж
    0.07
     Playback
    0.06
     bladder
    0.06
    ้วย
    0.06
    _slave
    0.06
     Slater
    0.06
     plague
    0.06
    la
    0.06
    Act Density 0.184%

    No Known Activations