INDEX
Explanations
references to boys or male characters across various contexts
New Auto-Interp
Negative Logits
lesh
-0.16
mente
-0.15
conti
-0.14
elay
-0.14
abbo
-0.14
lej
-0.14
нина
-0.14
ikan
-0.14
lements
-0.14
ect
-0.13
POSITIVE LOGITS
friend
0.17
اÙĨÙĩ
0.17
enda
0.16
insky
0.16
friends
0.15
hood
0.15
_configure
0.15
ÙĪÙĩ
0.15
avier
0.14
essler
0.14
Activations Density 0.026%