INDEX
Explanations
discussions surrounding violence and its justification
New Auto-Interp
Negative Logits
ajas
-0.15
hausen
-0.14
Td
-0.14
endas
-0.14
_ENV
-0.13
onse
-0.13
dings
-0.13
uster
-0.13
_Tis
-0.13
stu
-0.13
POSITIVE LOGITS
lo
0.30
Loot
0.28
loot
0.25
riot
0.25
cur
0.24
Mayor
0.23
nightly
0.22
downtown
0.22
destructive
0.22
/lo
0.22
Activations Density 0.010%