INDEX
Explanations
mentions of attacks and aggressive actions
New Auto-Interp
Negative Logits
{~-0.73
թվական
-0.72
Ав
-0.69
nO
-0.68
км
-0.62
مرئيه
-0.62
setOnItem
-0.61
ുന്നു
-0.61
uros
-0.61
└
-0.61
POSITIVE LOGITS
Attack
2.33
attack
2.30
ATTACK
2.17
attacks
2.16
attack
2.12
Attacks
2.07
Attack
2.01
Attacks
2.00
ATTACK
1.99
attacks
1.90
Activations Density 0.047%