INDEX
Explanations
references to toys in the text
New Auto-Interp
Negative Logits
naires
-0.16
iyon
-0.15
mers
-0.15
stag
-0.15
pheric
-0.15
stk
-0.15
nock
-0.15
wy
-0.14
eners
-0.14
anter
-0.14
POSITIVE LOGITS
toy
0.19
toy
0.18
toys
0.16
acht
0.16
oh
0.16
Toy
0.15
Toy
0.15
nton
0.15
ama
0.14
iet
0.14
Activations Density 0.008%