INDEX
Explanations
proper nouns, particularly names and organizations
New Auto-Interp
Negative Logits
ephir
-0.16
_slow
-0.15
earch
-0.15
Slow
-0.15
emark
-0.15
ashed
-0.14
æŀĿ
-0.14
मन
-0.14
lub
-0.14
opis
-0.14
POSITIVE LOGITS
ki
0.24
en
0.23
ky
0.22
ka
0.18
song
0.17
ons
0.17
yah
0.17
ens
0.16
zc
0.15
hte
0.15
Activations Density 0.065%