INDEX
Explanations
phrases related to legendary figures or stories
New Auto-Interp
Negative Logits
er
-0.23
eron
-0.19
alus
-0.18
age
-0.16
comb
-0.16
going
-0.15
rage
-0.15
ÑģÑı
-0.15
apas
-0.15
eres
-0.14
POSITIVE LOGITS
ARY
0.22
naire
0.22
äre
0.21
naires
0.21
ary
0.20
itimate
0.20
ario
0.19
airy
0.19
SHIP
0.18
imized
0.18
Activations Density 0.016%