INDEX
Explanations
phrases that refer to narratives or stories
New Auto-Interp
Negative Logits
shire
-0.16
edList
-0.16
lund
-0.16
est
-0.15
lah
-0.15
zell
-0.15
emple
-0.15
estar
-0.15
lander
-0.15
wich
-0.15
POSITIVE LOGITS
tale
0.22
tales
0.21
Tale
0.20
told
0.19
/story
0.17
ith
0.17
ãĤ¹ãĥ¬
0.16
-utils
0.15
tras
0.15
milano
0.15
Activations Density 0.014%