INDEX
Explanations
New Auto-Interp
Negative Logits
I
0.92
3
0.92
4
0.91
6
0.84
٣
0.84
IE
0.82
5
0.79
ים
0.79
้
0.78
Q
0.77
POSITIVE LOGITS
1.21
to
0.91
y
0.82
zelfde
0.80
an
0.72
that
0.69
то
0.69
is
0.68
it
0.67
to
0.67
Activations Density 0.879%
I
3
4
6
٣
IE
5
ים
้
Q
to
y
zelfde
an
that
то
is
it
to