INDEX
Explanations
words related to insults or derogatory remarks
references to insults or derogatory language
New Auto-Interp
Negative Logits
arijuana
-0.82
20439
-0.78
ilver
-0.75
negie
-0.74
etheus
-0.74
aver
-0.72
ccording
-0.70
angler
-0.68
Folder
-0.67
agically
-0.66
POSITIVE LOGITS
insult
1.37
insults
1.17
insulted
1.16
insulting
1.05
ingly
0.97
disrespect
0.97
humili
0.92
prejudice
0.89
offend
0.86
ãĤ¹ãĥĪ
0.86
Activations Density 0.015%