INDEX
    Explanations

    references to racial and ethnic language or slurs

    New Auto-Interp
    Negative Logits
    ienne
    -0.16
    ror
    -0.15
    ÑĸнÑĮ
    -0.15
    yre
    -0.15
    zej
    -0.14
    semicolon
    -0.14
    yro
    -0.13
    102
    -0.13
     altru
    -0.13
    ë¬
    -0.13
    POSITIVE LOGITS
     epith
    0.25
     language
    0.23
     obsc
    0.22
     curse
    0.21
     words
    0.20
     vul
    0.20
     swear
    0.20
     coarse
    0.20
     swearing
    0.19
     sworn
    0.19
    Act Density 0.074%

    No Known Activations