INDEX
    Explanations

    demonstrating how hate speech functions

    New Auto-Interp
    Negative Logits
    ên
    0.43
    ıl
    0.39
    ','
    0.38
    ini
    0.38
    getC
    0.37
    ってる
    0.37
    ȋ
    0.36
    omen
    0.35
    0.35
    erman
    0.35
    POSITIVE LOGITS
     isom
    0.44
     transgress
    0.41
     stratosphere
    0.40
    ネルギー
    0.39
    اقات
    0.39
     entour
    0.38
     postulate
    0.38
     ischemia
    0.38
     elektrische
    0.38
     magnitudes
    0.38
    Act Density 0.000%

    No Known Activations