INDEX
Explanations
mentions of attacks or aggressive actions
terms associated with aggressive discourse or confrontational communication
New Auto-Interp
Negative Logits
ITNESS
-0.73
Suc
-0.68
Norn
-0.66
significant
-0.66
Wonders
-0.66
orderly
-0.65
transitions
-0.65
Alive
-0.65
existent
-0.63
Transform
-0.63
POSITIVE LOGITS
leveled
1.18
accusing
1.18
tir
1.17
against
1.13
hurled
1.11
levied
1.07
against
1.06
slurs
1.02
denounce
1.00
denouncing
0.99
Activations Density 0.225%