INDEX
Explanations
proper nouns, particularly names and titles
New Auto-Interp
Negative Logits
awan
-0.15
anke
-0.15
arness
-0.15
oš
-0.15
paren
-0.14
osy
-0.14
olio
-0.14
elon
-0.14
ãĤ
-0.14
folio
-0.14
POSITIVE LOGITS
aiser
0.16
zcze
0.15
ober
0.15
ãĥ¼ãĥª
0.14
atorial
0.14
itant
0.14
ZH
0.14
WSTR
0.14
darwin
0.14
ÏĦιÏĥ
0.14
Activations Density 0.075%