INDEX
Explanations
words relating to people/biology and sex
New Auto-Interp
Negative Logits
male
-2.42
Male
-2.20
Male
-2.17
male
-2.11
female
-1.96
MALE
-1.91
Female
-1.86
Female
-1.82
female
-1.79
männ
-1.67
POSITIVE LOGITS
EconPapers
0.75
fromnode
0.68
tagHelperRunner
0.66
quias
0.65
GOG
0.62
makeText
0.61
ientos
0.59
rrggbb
0.58
Trag
0.56
umma
0.56
Activations Density 9.934%