INDEX
Explanations
terms related to adversaries or opponents
references to adversarial or opposing groups
New Auto-Interp
Negative Logits
eret
-0.74
ascript
-0.74
authorized
-0.73
oled
-0.73
printed
-0.72
razil
-0.71
Ħ¢
-0.70
estone
-0.69
otto
-0.69
arrell
-0.68
POSITIVE LOGITS
Enemy
1.02
foe
0.91
enemy
0.86
combatants
0.82
vanquished
0.77
hated
0.75
legion
0.75
Enemies
0.74
emies
0.74
waging
0.73
Activations Density 0.025%