INDEX
Explanations
derogatory and offensive language
derogatory terms and insults
New Auto-Interp
Negative Logits
qua
-0.71
tnc
-0.71
unification
-0.67
Horizons
-0.67
conduc
-0.67
transformative
-0.65
bilateral
-0.65
tranqu
-0.65
Passage
-0.64
stabilization
-0.63
POSITIVE LOGITS
bastard
1.06
fuck
1.05
liar
1.01
bitch
1.00
Bastard
1.00
cunt
0.99
asses
0.98
asshole
0.98
idiot
0.97
hole
0.96
Activations Density 0.137%