INDEX
Explanations
phrases related to online behavior, social activism, and specific names or terms
New Auto-Interp
Negative Logits
naire
-0.79
eous
-0.78
bunk
-0.76
BOX
-0.69
Opera
-0.68
Murd
-0.68
culosis
-0.68
Kaepernick
-0.68
IFIED
-0.67
Robbins
-0.67
POSITIVE LOGITS
aviour
1.61
avior
1.15
reath
1.02
beh
0.96
abus
0.92
cipl
0.91
assing
0.91
anging
0.90
avin
0.89
olic
0.89
Activations Density 7.222%