INDEX
Explanations
personal pronouns and gendered nouns
references to male and female characters
New Auto-Interp
Negative Logits
grave
-0.76
assing
-0.75
stellar
-0.66
ylon
-0.65
igmatic
-0.64
kefeller
-0.63
Observatory
-0.62
Ĥª
-0.61
irm
-0.61
cgi
-0.60
POSITIVE LOGITS
mos
0.95
Majesty
0.94
'll
0.93
didn
0.86
'd
0.85
knew
0.84
wanted
0.82
knows
0.81
didnt
0.80
hates
0.80
Activations Density 0.237%