INDEX
Explanations
Twitter usernames
underscore characters in usernames or handles
New Auto-Interp
Negative Logits
screenings
-0.70
eteria
-0.69
Manson
-0.68
pload
-0.67
ric
-0.67
aud
-0.66
Turing
-0.66
repent
-0.64
Casey
-0.63
chlorine
-0.63
POSITIVE LOGITS
ebook
1.23
chance
1.03
tro
0.98
dust
0.97
must
0.93
vs
0.92
blank
0.91
pill
0.91
main
0.90
dict
0.89
Activations Density 0.020%