INDEX
    Explanations

    references to hate or hate-related concepts

    New Auto-Interp
    Negative Logits
    erable
    -0.17
    erie
    -0.15
    éĥİ
    -0.15
    .scalablytyped
    -0.15
    ettle
    -0.14
    thora
    -0.14
     заÑģÑĤ
    -0.14
    lover
    -0.14
     Martial
    -0.14
    ieur
    -0.14
    POSITIVE LOGITS
     speech
    0.39
    fully
    0.32
     Speech
    0.32
     crime
    0.32
    Speech
    0.28
     crimes
    0.27
    speech
    0.27
    peech
    0.26
    fulness
    0.26
    crime
    0.25
    Act Density 0.010%

    No Known Activations