INDEX
Explanations
mentions of trolling behavior or trolls in online interactions
New Auto-Interp
Negative Logits
++++++++++++++++
-0.48
iyah
-0.46
kosher
-0.46
erald
-0.45
preservation
-0.45
riott
-0.45
hani
-0.45
enture
-0.45
âĢ¢âĢ¢âĢ¢âĢ¢
-0.45
clusive
-0.44
POSITIVE LOGITS
trolls
0.63
troll
0.62
ãĥĦ
0.61
hattan
0.57
bag
0.57
tro
0.56
bags
0.55
trolling
0.54
Troll
0.52
boxes
0.49
Activations Density 10.973%