INDEX
    Explanations

    references to hate and hateful behavior

    New Auto-Interp
    Negative Logits
    jam
    -0.17
    ooks
    -0.17
    nten
    -0.15
    çĽ
    -0.15
    umo
    -0.15
    ãĥ«ãĥķ
    -0.14
    á»
    -0.14
    theless
    -0.14
    onsense
    -0.14
    azzi
    -0.14
    POSITIVE LOGITS
     pol
    0.17
    aad
    0.15
    ouser
    0.15
    IH
    0.14
    inger
    0.14
    beck
    0.14
    onian
    0.14
    emp
    0.14
    \-
    0.13
    å¸ĿåĽ½
    0.13
    Act Density 0.006%

    No Known Activations