INDEX
Explanations
words related to antagonists, especially villains
references to villains in narratives or stories
New Auto-Interp
Negative Logits
galitarian
-0.86
ollen
-0.75
sterdam
-0.74
independent
-0.74
oday
-0.72
glas
-0.72
verett
-0.71
choes
-0.70
apers
-0.69
issance
-0.69
POSITIVE LOGITS
ous
1.17
ously
1.06
villain
1.00
villains
0.93
mastermind
0.92
Bane
0.88
esses
0.81
ess
0.77
hattan
0.76
oid
0.72
Activations Density 0.018%