INDEX
    Explanations

    expressions related to hatred or strong negative feelings

    feeling hate or related emotions

    New Auto-Interp
    Negative Logits
    Rohy
    -0.60
     erſt
    -0.59
    SBATCH
    -0.59
     stiefe
    -0.59
     }{@
    -0.58
    encodeWith
    -0.57
    Климат
    -0.56
    moveToFirst
    -0.55
    ьаж
    -0.55
    verifyException
    -0.53
    POSITIVE LOGITS
     hate
    1.11
     hatred
    1.09
     hates
    1.03
     hated
    1.03
     HATE
    1.00
     hating
    0.99
     Hate
    0.97
    hate
    0.96
    Hate
    0.96
     ненави
    0.85
    Act Density 0.035%

    No Known Activations