INDEX
Explanations
references to historical or fictional female heroes
New Auto-Interp
Negative Logits
ners
-0.18
peng
-0.16
olas
-0.16
gang
-0.15
ster
-0.15
ty
-0.15
mate
-0.14
ety
-0.14
ings
-0.14
ม
-0.14
POSITIVE LOGITS
ines
0.24
ically
0.22
ics
0.20
anova
0.18
ine
0.18
ism
0.17
itics
0.16
ic
0.16
897
0.16
MES
0.16
Activations Density 0.015%