INDEX
Explanations
derogatory terms and slurs related to women and minorities
vulgar insults
New Auto-Interp
Negative Logits
Życiorys
-0.56
Silas
-0.52
Surya
-0.49
Lumi
-0.49
dolu
-0.48
<unused51>
-0.47
<pad>
-0.47
<unused28>
-0.47
<unused14>
-0.47
<unused52>
-0.47
POSITIVE LOGITS
bitch
1.40
Bitch
1.35
Bitch
1.34
bitch
1.20
bitches
1.02
slut
0.52
createState
0.50
avoient
0.50
motherfucker
0.49
PMailer
0.49
Activations Density 0.010%