INDEX
Explanations
Twitter handles and dates in a specific format
references to news outlets and media organizations
New Auto-Interp
Negative Logits
Draper
-0.86
Goldberg
-0.79
Painter
-0.76
Faust
-0.76
Griffin
-0.76
captcha
-0.76
Professor
-0.75
Yose
-0.75
Morty
-0.74
Hollow
-0.72
POSITIVE LOGITS
Official
1.07
HQ
1.04
official
1.00
news
0.94
_
0.91
gov
0.90
euro
0.88
podcast
0.86
Europe
0.86
network
0.86
Activations Density 0.095%