INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     disturbs
    0.45
     disturbed
    0.39
     αρκε
    0.38
     někol
    0.38
     plufieurs
    0.38
    ζί
    0.37
     préf
    0.36
    𝒑
    0.36
     ид
    0.35
     запу
    0.35
    POSITIVE LOGITS
     always
    0.88
    always
    0.88
     ALWAYS
    0.86
    Always
    0.84
     siempre
    0.82
     всегда
    0.81
     Always
    0.80
    ALWAYS
    0.80
     every
    0.79
     selalu
    0.79
    Act Density 0.177%

    No Known Activations