INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ning
    -0.57
    nals
    -0.51
    lichkeit
    -0.48
    centration
    -0.48
    gemeinde
    -0.47
    olla
    -0.46
    alism
    -0.46
    ING
    -0.45
    idle
    -0.45
    alele
    -0.45
    POSITIVE LOGITS
    e
    0.69
    s
    0.69
    ton
    0.69
    ttes
    0.69
    ly
    0.65
    man
    0.64
    mente
    0.62
    ds
    0.59
    tt
    0.57
    the
    0.54
    Act Density 0.123%

    No Known Activations