INDEX
Explanations
references to the act of reading
New Auto-Interp
Negative Logits
ask
-0.17
iv
-0.16
ck
-0.16
d
-0.16
oc
-0.15
udad
-0.15
sten
-0.15
ated
-0.15
use
-0.15
ped
-0.15
POSITIVE LOGITS
just
0.24
/list
0.23
/view
0.23
mitted
0.22
comprehension
0.21
/watch
0.20
åıĸ
0.20
/write
0.19
ied
0.19
iness
0.18
Activations Density 0.072%