INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     therap
    -0.07
     CLI
    -0.07
     congratulate
    -0.07
    -0.07
    eneral
    -0.07
    ITLE
    -0.06
     이야
    -0.06
     služby
    -0.06
     studi
    -0.06
    -0.06
    POSITIVE LOGITS
     rot
    0.16
     Rot
    0.15
    Rot
    0.14
     rotor
    0.10
    rot
    0.09
     rotten
    0.09
    .rot
    0.08
     ROT
    0.08
    ot
    0.08
    _rot
    0.07
    Act Density 0.004%

    No Known Activations