INDEX
Explanations
words related to endings and conclusions
New Auto-Interp
Negative Logits
er
-0.23
eru
-0.20
erate
-0.19
quez
-0.18
erot
-0.18
ambre
-0.16
ETY
-0.16
cona
-0.16
erin
-0.16
oine
-0.15
POSITIVE LOGITS
y
0.23
ocrine
0.22
<|begin_of_text|>
0.20
ele
0.19
rick
0.19
eb
0.19
ell
0.18
eh
0.18
all
0.17
ahl
0.17
Activations Density 0.022%