INDEX
Explanations
references to significant male figures
New Auto-Interp
Negative Logits
puter
-0.18
ingly
-0.15
ted
-0.15
æķ£
-0.15
itionally
-0.14
agy
-0.14
syn
-0.14
itte
-0.14
lectric
-0.14
gether
-0.14
POSITIVE LOGITS
ufac
0.20
hattan
0.19
/her
0.17
opause
0.16
agements
0.16
hunt
0.16
iac
0.16
agment
0.15
äh
0.15
ne
0.14
Activations Density 0.143%