INDEX
Explanations
links to Twitter posts
URLs or web links in the text
New Auto-Interp
Negative Logits
boun
-0.67
notices
-0.63
Fortune
-0.62
appeals
-0.61
decor
-0.61
cler
-0.61
memos
-0.61
successfully
-0.60
contemplation
-0.59
franchise
-0.59
POSITIVE LOGITS
dL
1.14
Gh
1.13
dk
1.12
CN
1.10
Hu
1.09
OX
1.07
nv
1.07
oa
1.07
bf
1.06
dp
1.06
Activations Density 0.016%