INDEX
Explanations
descriptions of physical violence or assault
references to instances of physical violence or abuse
New Auto-Interp
Negative Logits
facult
-0.87
export
-0.81
orrow
-0.79
odor
-0.77
entric
-0.76
gravity
-0.74
osion
-0.74
aird
-0.73
orescent
-0.70
ateral
-0.70
POSITIVE LOGITS
beaten
1.32
beat
1.20
Beat
1.02
beating
0.99
beat
0.91
¶æ
0.88
Beat
0.88
heet
0.86
boxing
0.81
against
0.74
Activations Density 0.011%