INDEX
Explanations
Twitter handles and social media usernames with specific letter combinations
usernames and handles related to social media
New Auto-Interp
Negative Logits
Scheme
-0.94
ACTIONS
-0.91
Reserve
-0.82
Ninth
-0.82
Direction
-0.81
Orient
-0.80
Index
-0.79
Enforcement
-0.79
Hearts
-0.77
Inspection
-0.76
POSITIVE LOGITS
yp
1.14
podcast
1.05
mc
0.99
_
0.98
raham
0.96
idth
0.95
fd
0.94
tv
0.94
sth
0.93
olson
0.93
Activations Density 0.138%