INDEX
Explanations
email addresses or usernames on social media platforms
mentions of social media handles
New Auto-Interp
Negative Logits
accordingly
-0.69
Strikes
-0.68
fitting
-0.66
Takeru
-0.65
hence
-0.64
reintrodu
-0.63
Pumpkin
-0.63
humiliating
-0.63
Supplemental
-0.62
ITION
-0.62
POSITIVE LOGITS
gmail
1.69
yahoo
1.45
hot
1.10
debian
1.10
bleacher
1.04
earth
1.04
home
1.00
mac
0.97
hillary
0.97
domain
0.96
Activations Density 0.017%