INDEX
Explanations
actions related to physical violence
instances of physical violence and related actions
New Auto-Interp
Negative Logits
lance
-0.78
invoke
-0.66
faithfully
-0.62
uras
-0.62
ellig
-0.62
UID
-0.61
erning
-0.61
akings
-0.61
IRE
-0.60
subpoen
-0.60
POSITIVE LOGITS
stret
0.96
floor
0.94
unconscious
0.91
ground
0.88
bushes
0.87
pavement
0.86
couch
0.86
concrete
0.84
bed
0.84
sofa
0.83
Activations Density 0.181%