INDEX
    Explanations

    racial slurs

    New Auto-Interp
    Negative Logits
    الي
    -0.07
    _PROPERTY
    -0.06
    	hs
    -0.06
    ивання
    -0.06
     membres
    -0.06
     tyres
    -0.06
     totaling
    -0.06
    ThanOr
    -0.06
     champagne
    -0.06
     lớp
    -0.06
    POSITIVE LOGITS
     slur
    0.08
    /hash
    0.07
     NSURL
    0.07
     ere
    0.06
     INIT
    0.06
     Hund
    0.06
     colorful
    0.06
     accur
    0.06
    ,password
    0.06
    _REMOVE
    0.06
    Act Density 0.010%

    No Known Activations