INDEX
Explanations
mentions of violent or cruel acts
references to violence or severe harm
New Auto-Interp
Negative Logits
cript
-0.78
annis
-0.75
ploma
-0.73
kj
-0.73
verage
-0.72
leaf
-0.72
OPLE
-0.71
FU
-0.71
Libraries
-0.70
BU
-0.68
POSITIVE LOGITS
earthqu
0.91
ized
0.90
assault
0.85
beasts
0.85
killers
0.82
assaults
0.80
punishments
0.80
dictator
0.79
murdering
0.78
murders
0.78
Activations Density 0.017%