INDEX
    Explanations

    explicit references to aggressive or hostile language

    New Auto-Interp
    Negative Logits
    iferay
    -0.16
    太éĥİ
    -0.16
     Guys
    -0.16
    arth
    -0.16
     Zap
    -0.14
    longleftrightarrow
    -0.14
    ouro
    -0.14
    .glob
    -0.14
    sez
    -0.14
     boobs
    -0.14
    POSITIVE LOGITS
     nig
    0.25
     ass
    0.20
     Offset
    0.19
     hoe
    0.19
     hood
    0.18
     mf
    0.18
     Flex
    0.17
     Hood
    0.17
     hom
    0.17
    ayo
    0.17
    Act Density 0.092%

    No Known Activations