INDEX
Explanations
instances of threats and negative commentary related to gender dynamics
New Auto-Interp
Negative Logits
ê²ģ
-0.14
imi
-0.14
umni
-0.14
.showError
-0.14
imos
-0.13
ActivityIndicator
-0.13
upe
-0.13
ALAR
-0.13
tape
-0.13
_STRIP
-0.13
POSITIVE LOGITS
trolls
0.36
trolling
0.35
troll
0.32
hat
0.29
cyber
0.29
vit
0.28
Troll
0.27
mean
0.27
online
0.26
keyboard
0.24
Activations Density 0.045%