INDEX
Explanations
derogatory and offensive language
New Auto-Interp
Negative Logits
enhagen
-0.72
escal
-0.68
corrid
-0.66
earchers
-0.66
ãĥ¼ãĥ³
-0.65
paren
-0.64
restricted
-0.64
Passage
-0.64
bilateral
-0.63
uninterrupted
-0.63
POSITIVE LOGITS
fuck
1.06
bastard
1.01
bitch
0.96
asshole
0.96
hole
0.95
gery
0.95
hypocr
0.94
cunt
0.94
crap
0.91
liar
0.89
Activations Density 0.167%