INDEX
Explanations
concepts related to bullying and harassment
New Auto-Interp
Negative Logits
oppel
-0.17
adio
-0.16
æ§
-0.14
tet
-0.14
hypocrisy
-0.14
amine
-0.14
defe
-0.13
rij
-0.13
POSE
-0.13
820
-0.13
POSITIVE LOGITS
bullying
0.29
bul
0.27
bully
0.26
Bul
0.24
harassment
0.23
bullied
0.23
harass
0.22
victim
0.21
victim
0.20
teasing
0.19
Activations Density 0.038%