INDEX
Explanations
names and user profiles
presence of the end-of-text token
New Auto-Interp
Negative Logits
Instr
-0.82
Seym
-0.66
Canary
-0.65
Pound
-0.63
Niet
-0.62
Channel
-0.62
Ninth
-0.60
Moroc
-0.60
[*
-0.60
Kik
-0.60
POSITIVE LOGITS
":{"0.77
@
0.73
lement
0.72
_
0.68
Profile
0.66
mosp
0.66
llular
0.66
roid
0.65
podcast
0.63
ocious
0.63
Activations Density 0.182%