INDEX
Explanations
instances of people being insulted or making insulting comments
phrases related to social media interactions and conflicts
New Auto-Interp
Negative Logits
Zucker
-0.96
Zub
-0.88
dives
-0.85
suc
-0.81
SQ
-0.78
bles
-0.78
199
-0.76
Perez
-0.73
kittens
-0.73
dive
-0.72
POSITIVE LOGITS
Arm
2.19
Arm
2.12
arm
2.08
ARM
2.03
ARM
1.83
arm
1.75
arms
1.45
Armory
1.38
Armstrong
1.38
Arms
1.36
Activations Density 0.233%