INDEX
Explanations
explicit references to aggressive or hostile language
New Auto-Interp
Negative Logits
iferay
-0.16
太éĥİ
-0.16
Guys
-0.16
arth
-0.16
Zap
-0.14
longleftrightarrow
-0.14
ouro
-0.14
.glob
-0.14
sez
-0.14
boobs
-0.14
POSITIVE LOGITS
nig
0.25
ass
0.20
Offset
0.19
hoe
0.19
hood
0.18
mf
0.18
Flex
0.17
Hood
0.17
hom
0.17
ayo
0.17
Activations Density 0.092%