INDEX
    Explanations

    spam, abusive, or offensive content

    New Auto-Interp
    Negative Logits
     meis
    -1.01
     laziness
    -1.00
     goku
    -0.95
     jep
    -0.91
     nerds
    -0.91
     suzuki
    -0.90
     itali
    -0.90
     labrador
    -0.90
     vasco
    -0.90
     versace
    -0.89
    POSITIVE LOGITS
     spam
    2.33
     hate
    1.97
     racist
    1.81
     abusive
    1.80
     malicious
    1.79
     offensive
    1.75
     harmful
    1.72
    spam
    1.70
     porn
    1.68
     bad
    1.66
    Act Density 0.111%

    No Known Activations