INDEX
Explanations
violence-related phrases involving physical harm and law enforcement
references to violent incidents or fatalities
New Auto-Interp
Negative Logits
retty
-0.55
awaru
-0.52
udos
-0.51
Vaugh
-0.50
conclud
-0.50
soDeliveryDate
-0.49
incorpor
-0.49
furthermore
-0.49
however
-0.48
moreover
-0.48
POSITIVE LOGITS
)?
0.64
?",
0.59
\'
0.53
apor
0.52
])
0.52
)|
0.51
Ħ¢
0.50
)]
0.49
their
0.49
?),
0.49
Activations Density 2.041%