INDEX
Explanations
mentions of adversaries or foes
references to adversarial characters or entities
New Auto-Interp
Negative Logits
lic
-0.83
eret
-0.79
otide
-0.77
auntlets
-0.74
ced
-0.73
cer
-0.70
otto
-0.70
oled
-0.69
Shot
-0.69
UTH
-0.69
POSITIVE LOGITS
enemies
1.24
foe
1.15
enemy
1.12
adversaries
1.11
Enemies
1.07
foes
1.05
Enemy
0.99
undermin
0.96
adversary
0.91
emies
0.89
Activations Density 0.010%