INDEX
Explanations
descriptions related to violent or harmful actions
instances of the word "brutal" in contexts related to violence or suffering
New Auto-Interp
Negative Logits
leaf
-0.87
ploma
-0.77
ource
-0.77
cript
-0.76
verage
-0.75
OPLE
-0.74
clips
-0.74
BU
-0.71
arten
-0.71
Recommend
-0.71
POSITIVE LOGITS
assault
1.03
ized
1.01
assaults
0.97
murders
0.95
torture
0.93
izing
0.92
punishments
0.91
beasts
0.89
murder
0.85
retribution
0.84
Activations Density 0.040%