INDEX
    Explanations

    ill-formed words

    words related to slurs or derogatory terms

    New Auto-Interp
    Negative Logits
    ICLE
    -0.78
    ãģ®éŃĶ
    -0.76
    IFIC
    -0.74
    enegger
    -0.73
     Cause
    -0.68
    éļ
    -0.65
    BLIC
    -0.65
     Realms
    -0.64
    Trust
    -0.63
    terms
    -0.63
    POSITIVE LOGITS
    anted
    1.19
    udge
    1.16
    ugg
    1.16
    otted
    1.14
    ights
    1.13
    asher
    1.12
    ashes
    1.12
    ither
    1.12
    inging
    1.11
    ipp
    1.11
    Act Density 0.010%

    No Known Activations