INDEX
Explanations
references to violence and its consequences
New Auto-Interp
Negative Logits
orne
-0.16
onya
-0.15
RAP
-0.15
esty
-0.15
autocomplete
-0.15
otos
-0.14
bilir
-0.14
anus
-0.14
_motion
-0.14
ancel
-0.14
POSITIVE LOGITS
violence
0.45
Violence
0.36
violent
0.34
viol
0.33
-viol
0.31
resort
0.29
Viol
0.28
physical
0.26
violent
0.26
viol
0.24
Activations Density 0.245%