INDEX
Explanations
proper nouns
mentions of specific individuals, particularly female figures
New Auto-Interp
Negative Logits
iferation
-0.80
WC
-0.72
HCR
-0.69
UST
-0.67
udic
-0.66
Nanto
-0.64
VL
-0.64
HUD
-0.63
essee
-0.63
BD
-0.63
POSITIVE LOGITS
Betty
0.92
yip
0.88
keye
0.84
rics
0.82
rand
0.77
hesda
0.76
oro
0.75
rants
0.75
Seym
0.74
plin
0.72
Activations Density 0.013%