INDEX
Explanations
mentions of sexual activities or behaviors
references to sexual misconduct and assault
New Auto-Interp
Negative Logits
gery
-0.85
quickShipAvailable
-0.75
gets
-0.72
univers
-0.71
batch
-0.67
arity
-0.66
beginnings
-0.65
realization
-0.64
journalism
-0.64
ger
-0.64
POSITIVE LOGITS
assaulted
1.13
abused
1.08
harassed
1.02
charged
1.01
abusing
0.99
transmitted
0.96
exploited
0.93
assaulting
0.91
harmed
0.88
stimulated
0.86
Activations Density 0.030%