INDEX
Explanations
insults and controversial statements
New Auto-Interp
Negative Logits
ail
-0.78
negie
-0.75
ills
-0.75
illon
-0.74
frames
-0.72
olutions
-0.70
angler
-0.67
overs
-0.64
hare
-0.64
wings
-0.64
POSITIVE LOGITS
insult
1.07
insulted
1.06
disrespect
1.03
slurs
0.99
humour
0.93
insulting
0.93
insults
0.90
stereotypes
0.88
dispar
0.86
humor
0.86
Activations Density 0.071%