INDEX
Explanations
words related to brutality and violence
New Auto-Interp
Negative Logits
leaf
-0.76
conservancy
-0.74
Recommend
-0.72
uve
-0.72
ource
-0.71
clips
-0.71
zyme
-0.69
BU
-0.68
peak
-0.68
ilk
-0.68
POSITIVE LOGITS
ized
1.06
assault
0.98
izing
0.97
assaults
0.93
murders
0.92
murdering
0.87
ization
0.86
punishments
0.86
ised
0.86
repression
0.86
Activations Density 0.033%