INDEX
Explanations
various numeric patterns or representations
New Auto-Interp
Negative Logits
ness
-0.81
ers
-0.76
↵↵
-0.75
—
-0.75
<sup>
-0.70
er
-0.70
an
-0.69
ja
-0.67
<h2>
-0.67
en
-0.66
POSITIVE LOGITS
}}$}
1.40
"}
1.38
"]}
1.36
'}
1.31
']}
1.30
")}
1.29
]")]
1.27
"}
1.22
).}
1.12
')}
1.11
Activations Density 0.288%