INDEX
Explanations
spam, abusive, or offensive content
New Auto-Interp
Negative Logits
meis
-1.01
laziness
-1.00
goku
-0.95
jep
-0.91
nerds
-0.91
suzuki
-0.90
itali
-0.90
labrador
-0.90
vasco
-0.90
versace
-0.89
POSITIVE LOGITS
spam
2.33
hate
1.97
racist
1.81
abusive
1.80
malicious
1.79
offensive
1.75
harmful
1.72
spam
1.70
porn
1.68
bad
1.66
Activations Density 0.111%