INDEX
Explanations
Twitter usernames
proper nouns, particularly names and usernames
New Auto-Interp
Negative Logits
substitutes
-0.74
theless
-0.73
borne
-0.70
ACTIONS
-0.70
é¾įå¥ij士
-0.68
substituted
-0.65
resid
-0.65
cured
-0.62
press
-0.61
uncertain
-0.60
POSITIVE LOGITS
uff
0.88
Magikarp
0.86
WithNo
0.85
Jr
0.84
trump
0.83
_.
0.83
Own
0.83
Whe
0.82
FT
0.82
td
0.81
Activations Density 0.079%