INDEX
Explanations
words associated with masculine and feminine characteristics
New Auto-Interp
Negative Logits
aneers
-0.77
stall
-0.76
nl
-0.69
tein
-0.69
pan
-0.65
umblr
-0.64
OUT
-0.64
ettel
-0.63
zan
-0.62
ciples
-0.62
POSITIVE LOGITS
atively
0.91
ively
0.88
ativity
0.83
affili
0.73
ative
0.72
associations
0.71
enza
0.69
iated
0.69
hips
0.68
ãĥł
0.68
Activations Density 0.073%