INDEX
Explanations
references to specific experimental details and clarifications
New Auto-Interp
Negative Logits
στα
-0.36
occasionally
-0.31
ようになります
-0.27
m
-0.26
.
-0.25
tròn
-0.24
alami
-0.24
↵
-0.24
↵↵
-0.24
temporarily
-0.24
POSITIVE LOGITS
ſelbſt
0.88
ſind
0.87
<unused8>
0.86
<unused41>
0.85
<unused79>
0.85
<unused14>
0.85
<unused52>
0.85
<unused68>
0.85
[@BOS@]
0.85
<unused16>
0.85
Activations Density 0.719%