INDEX
    Explanations

    expressions of strong negative feelings, particularly hate and dislike

    New Auto-Interp
    Negative Logits
    erland
    -0.16
    osi
    -0.16
     Wor
    -0.15
    .tk
    -0.15
    airo
    -0.14
    illo
    -0.13
    ected
    -0.13
    çek
    -0.13
    ILLE
    -0.13
    onth
    -0.13
    POSITIVE LOGITS
     admitting
    0.15
    losing
    0.15
    ÏĢιÏĥ
    0.15
    loss
    0.15
    lose
    0.14
     surprises
    0.14
    ulumi
    0.14
     disrupt
    0.14
     вообÑīе
    0.14
    ì¦Ŀ
    0.14
    Act Density 0.125%

    No Known Activations