INDEX
Explanations
phrases related to insults
references to insults and derogatory language
New Auto-Interp
Negative Logits
negie
-0.73
ills
-0.73
arten
-0.70
enfranch
-0.67
20439
-0.65
frames
-0.64
iggle
-0.64
ulhu
-0.63
olin
-0.63
illon
-0.62
POSITIVE LOGITS
insult
1.02
insults
0.96
insulted
0.95
disrespect
0.93
insulting
0.90
humour
0.83
caric
0.82
ingly
0.81
humili
0.80
hygiene
0.76
Activations Density 0.092%