INDEX
Explanations
punctuation marks and sentence-ending symbols
New Auto-Interp
Negative Logits
-0.80
-0.67
(
-0.67
K
-0.64
S
-0.60
-
-0.55
A
-0.54
-
-0.53
R
-0.53
P
-0.52
POSITIVE LOGITS
.:
1.90
.-
1.81
./
1.67
.):
1.66
.!
1.64
.–
1.60
.).
1.58
.);
1.57
.—
1.56
.;
1.55
Activations Density 0.211%