INDEX
Explanations
references to female protagonists or figures of significance
New Auto-Interp
Negative Logits
ners
-0.18
pun
-0.16
igkeit
-0.16
away
-0.16
olas
-0.15
ty
-0.15
mate
-0.15
रहन
-0.15
ster
-0.15
tures
-0.15
POSITIVE LOGITS
ines
0.38
ics
0.30
ine
0.27
ically
0.26
INES
0.25
ism
0.23
ÃŃna
0.23
INE
0.22
Worship
0.21
icism
0.21
Activations Density 0.022%