INDEX
Explanations
references to adversarial entities or opponents
New Auto-Interp
Negative Logits
pr
-0.59
Norr
-0.57
decía
-0.56
Against
-0.56
er
-0.55
Pr
-0.54
R
-0.52
At
-0.52
S
-0.51
est
-0.51
POSITIVE LOGITS
Enemy
1.36
Enemy
1.16
Enemies
1.16
enemy
1.15
enemy
1.15
enemies
1.15
enemies
1.12
Enemies
1.07
ennemi
1.05
nemy
1.00
Activations Density 0.006%