INDEX
Explanations
mentions of aggressive behavior or aggression-related terms
terms related to aggressive behaviors and violence
New Auto-Interp
Negative Logits
FORMATION
-0.76
zl
-0.74
verend
-0.74
HCR
-0.70
obook
-0.68
lev
-0.67
Bake
-0.67
ummer
-0.66
haul
-0.65
aver
-0.65
POSITIVE LOGITS
aggression
0.89
against
0.85
aggress
0.82
towards
0.82
toward
0.79
posture
0.78
iveness
0.77
escalation
0.76
provocation
0.76
Agg
0.73
Activations Density 0.053%