INDEX
Explanations
Twitter handles to follow
instances of the word "Follow" indicating social media references
New Auto-Interp
Negative Logits
pite
-0.84
inese
-0.73
ILCS
-0.71
negie
-0.67
pressed
-0.66
cit
-0.66
unicip
-0.66
rouse
-0.65
ately
-0.64
dfx
-0.64
POSITIVE LOGITS
Follow
0.92
cies
0.84
ers
0.82
ership
0.81
Follow
0.78
@
0.77
Updates
0.75
ed
0.72
follow
0.69
ï¸ı
0.69
Activations Density 0.020%