INDEX
Explanations
patterns of structure and syntax in programming code
New Auto-Interp
Negative Logits
↵
-0.22
R
-0.20
M
-0.19
F
-0.18
u
-0.18
y
-0.18
el
-0.18
N
-0.18
↵
-0.18
B
-0.17
POSITIVE LOGITS
0.20
orna
0.16
least
0.15
upy
0.15
mlx
0.14
caf
0.14
Least
0.14
and
0.14
least
0.14
hire
0.14
Activations Density 0.125%