INDEX
Explanations
direct instructions or actions
New Auto-Interp
Negative Logits
is
1.59
{1.29
as
1.11
has
1.11
.
1.10
د
1.09
einem
1.02
}
0.99
zeigen
0.98
=
0.96
POSITIVE LOGITS
the
1.41
n
1.33
is
1.27
יות
1.23
on
1.20
ing
1.13
as
1.06
in
1.04
it
1.02
a
1.01
Activations Density 0.086%