INDEX
Explanations
references to male individuals and their roles in various contexts
New Auto-Interp
Negative Logits
dale
-0.18
lu
-0.18
ally
-0.18
sale
-0.18
naire
-0.17
lauf
-0.16
ale
-0.16
rial
-0.15
ese
-0.15
rie
-0.15
POSITIVE LOGITS
volent
0.26
-dominated
0.18
itarian
0.18
factor
0.17
ÅŁtir
0.17
ynes
0.15
cul
0.15
æ´²
0.15
utdown
0.15
quota
0.14
Activations Density 0.024%