INDEX
Explanations
words related to negative social behavior or mistreatment of individuals
mentions of harassment and related behaviors
New Auto-Interp
Negative Logits
éĹĺ
-0.99
icts
-0.78
inet
-0.78
zyme
-0.75
ACTED
-0.73
chart
-0.72
arch
-0.71
lined
-0.71
essential
-0.70
shows
-0.70
POSITIVE LOGITS
harass
0.91
harassment
0.90
harassing
0.87
harassed
0.78
assment
0.76
stalking
0.75
accus
0.75
tactics
0.73
ãĥĨ
0.71
ingly
0.67
Activations Density 0.026%