INDEX
Explanations
references to incidents of violence or hate crimes
New Auto-Interp
Negative Logits
iceps
-0.14
illing
-0.14
compression
-0.13
******************************************************************************↵
-0.13
428
-0.13
woo
-0.13
oÄį
-0.13
оÑĩек
-0.13
Bout
-0.12
747
-0.12
POSITIVE LOGITS
vandalism
0.43
vandal
0.41
graffiti
0.38
spray
0.33
gra
0.31
damage
0.30
Spray
0.29
ван
0.29
arson
0.28
Gra
0.28
Activations Density 0.057%