INDEX
Explanations
terms related to adversarial entities or threats
New Auto-Interp
Negative Logits
erſt
-0.73
dieſer
-0.72
verſch
-0.72
müſſen
-0.71
wiſſen
-0.70
unſer
-0.70
ſeinem
-0.69
ſelbſt
-0.69
ſans
-0.69
ſeinen
-0.69
POSITIVE LOGITS
enemy
1.33
opponent
1.21
opponents
1.20
enemy
1.08
enemies
1.07
Enemy
1.07
musuh
1.05
adversaries
1.02
Enemy
1.00
enemigo
0.98
Activations Density 0.277%