INDEX
Explanations
terms related to violence or aggression
New Auto-Interp
Negative Logits
leaf
-0.82
FU
-0.72
ploma
-0.71
BU
-0.70
cript
-0.70
Unlimited
-0.69
hner
-0.69
ource
-0.69
OPLE
-0.68
Script
-0.68
POSITIVE LOGITS
ized
0.97
assault
0.92
assaults
0.87
ified
0.86
retribution
0.85
killers
0.84
beasts
0.83
punishments
0.83
murdering
0.82
izing
0.82
Activations Density 0.021%