INDEX
Explanations
questions posed in the text
New Auto-Interp
Negative Logits
entar
-0.16
errar
-0.15
ented
-0.15
ent
-0.14
iller
-0.14
ifter
-0.14
ublisher
-0.14
omor
-0.14
anity
-0.13
ãn
-0.13
POSITIVE LOGITS
better
0.31
better
0.23
mejor
0.21
else
0.20
could
0.19
Better
0.18
more
0.17
Better
0.17
do
0.17
could
0.16
Activations Density 0.043%