INDEX
Explanations
punctuation marks, specifically periods
New Auto-Interp
Negative Logits
.
-0.17
akens
-0.16
hsi
-0.15
[
-0.15
commitment
-0.15
*
-0.15
itu
-0.14
zh
-0.14
it
-0.14
ul
-0.14
POSITIVE LOGITS
5
0.26
8
0.24
75
0.23
7
0.22
6
0.21
9
0.21
0
0.20
4
0.19
3
0.18
85
0.18
Activations Density 0.083%