INDEX
Explanations
references to attacks or aggressive actions
New Auto-Interp
Negative Logits
{~-0.76
Ав
-0.69
whole
-0.66
թվական
-0.64
nO
-0.64
gu
-0.63
ύ
-0.61
км
-0.61
RUNTIME
-0.61
Beans
-0.60
POSITIVE LOGITS
Attack
1.79
ATTACK
1.69
attack
1.66
attacks
1.61
Attacks
1.59
Attacks
1.56
ATTACK
1.55
attack
1.55
Attack
1.46
attacks
1.46
Activations Density 0.067%