INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ires
    0.84
    the
    0.74
    t
    0.73
    ées
    0.70
    ING
    0.67
    m
    0.65
    inary
    0.63
    illing
    0.62
    ron
    0.61
    acion
    0.61
    POSITIVE LOGITS
     kindness
    1.17
     Kindness
    1.05
    ва
    1.02
    ות
    1.02
    ко
    0.99
    ма
    0.98
    ان
    0.93
    ться
    0.91
    ם
    0.91
    ات
    0.89
    Act Density 0.003%

    No Known Activations