INDEX
Explanations
proper nouns, specifically names or titles
New Auto-Interp
Negative Logits
arta
-0.17
ört
-0.16
optera
-0.15
787
-0.14
377
-0.14
aws
-0.14
plays
-0.14
pr
-0.14
ries
-0.14
plut
-0.14
POSITIVE LOGITS
надлеж
0.18
.ib
0.16
žÃŃ
0.16
istan
0.15
eless
0.15
ifar
0.14
/ay
0.14
ož
0.14
oÄį
0.14
گار
0.14
Activations Density 0.002%