INDEX
Explanations
okay, followed by introductory phrases
New Auto-Interp
Negative Logits
us
0.89
.
0.83
(
0.73
ig
0.67
8
0.63
5
0.62
ovaný
0.61
7
0.60
ung
0.58
gl
0.58
POSITIVE LOGITS
m
0.69
it
0.67
d
0.67
ለያዩ
0.66
to
0.63
s
0.63
and
0.62
be
0.61
の
0.60
dır
0.60
Activations Density 0.351%