INDEX
Explanations
expletives or strong vulgar language
New Auto-Interp
Negative Logits
Ĭ±
-0.73
ancest
-0.71
mosqu
-0.70
pestic
-0.67
utterstock
-0.67
æĪ¦
-0.66
rece
-0.66
reluct
-0.64
Decre
-0.64
reper
-0.63
POSITIVE LOGITS
hole
1.18
holes
1.15
tty
1.02
king
0.98
cking
0.94
kers
0.92
shit
0.88
gger
0.88
fuck
0.85
ing
0.85
Activations Density 0.009%