INDEX
Explanations
instances of insults and derogatory language
New Auto-Interp
Negative Logits
orr
-0.07
ilden
-0.07
yles
-0.07
ales
-0.07
gie
-0.07
erness
-0.07
stral
-0.07
ills
-0.07
elp
-0.07
over
-0.07
POSITIVE LOGITS
ively
0.09
ingly
0.09
ably
0.08
uous
0.08
atory
0.08
antly
0.07
ive
0.07
271
0.07
urb
0.06
acios
0.06
Activations Density 0.004%