INDEX
Explanations
mentions of physical violence, specifically instances of being physically attacked or harmed
instances of physical violence or assault
New Auto-Interp
Negative Logits
orrow
-0.89
isse
-0.75
gravity
-0.74
oplan
-0.73
ortium
-0.72
facult
-0.72
FF
-0.70
alg
-0.67
ordan
-0.67
rover
-0.67
POSITIVE LOGITS
beaten
1.09
beat
0.99
beating
0.86
¶æ
0.82
boxing
0.80
down
0.79
soever
0.77
Beat
0.76
Beat
0.75
beat
0.75
Activations Density 0.018%