INDEX
Explanations
words related to negative behaviors or actions, especially harassment
instances of the term "harassment" and related contexts
New Auto-Interp
Negative Logits
éĹĺ
-0.91
stanbul
-0.79
rient
-0.77
zyme
-0.77
ACTED
-0.74
ECD
-0.73
iets
-0.71
inet
-0.69
arch
-0.68
kos
-0.68
POSITIVE LOGITS
harassment
1.00
harass
0.96
harassing
0.93
accus
0.89
stalking
0.82
tactics
0.81
harassed
0.80
assment
0.78
leveled
0.72
inflic
0.72
Activations Density 0.030%