INDEX
Explanations
mentions of bullying and related behaviors
New Auto-Interp
Negative Logits
rahim
-0.18
boom
-0.15
adio
-0.15
окол
-0.14
pong
-0.14
thất
-0.14
ocaly
-0.13
Baby
-0.13
obi
-0.13
zig
-0.13
POSITIVE LOGITS
bull
0.63
Bul
0.59
bul
0.56
Bull
0.54
bullying
0.52
bully
0.51
bull
0.47
bul
0.45
bullied
0.43
cyber
0.41
Activations Density 0.038%