INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     scaling
    -0.08
    Bank
    -0.08
     BANK
    -0.08
     verrou
    -0.07
     pesado
    -0.07
     isn't
    -0.07
    .cur
    -0.07
    _LOCK
    -0.07
     nla
    -0.07
     Hagen
    -0.07
    POSITIVE LOGITS
    leo
    0.09
     tones
    0.08
     juicio
    0.08
     sexuelle
    0.08
    ţi
    0.08
     judging
    0.07
    ffred
    0.07
     tone
    0.07
    Tone
    0.07
    yny
    0.07
    Act Density 0.001%

    No Known Activations