INDEX
Explanations
periods indicating sentence boundaries
end of sentence connectors
New Auto-Interp
Negative Logits
IntoConstraints
-0.97
enderror
-0.92
𑄮
-0.91
Dieſe
-0.90
<unused79>
-0.90
[@BOS@]
-0.90
<pad>
-0.90
<unused16>
-0.90
<unused8>
-0.90
<unused6>
-0.90
POSITIVE LOGITS
,
0.29
/
0.27
only
0.27
not
0.26
i
0.26
-
0.25
in
0.24
significantly
0.24
0.23
(
0.23
Activations Density 0.008%