INDEX
Explanations
references to letters and letter-writing
New Auto-Interp
Negative Logits
yan
-0.17
yum
-0.17
yor
-0.16
onet
-0.16
andle
-0.15
emaker
-0.15
yon
-0.15
zdy
-0.15
vier
-0.15
slu
-0.15
POSITIVE LOGITS
press
0.28
head
0.24
atura
0.21
ed
0.20
ing
0.20
-spacing
0.19
addressed
0.19
ewe
0.18
heads
0.17
opener
0.17
Activations Density 0.026%