INDEX
Explanations
s followed by common words
New Auto-Interp
Negative Logits
ла
0.66
ты
0.66
(
0.62
ти
0.61
ো
0.58
stockings
0.54
νει
0.52
sparsim
0.52
mishaps
0.51
ва
0.50
POSITIVE LOGITS
is
0.86
d
0.72
was
0.72
de
0.66
an
0.66
que
0.64
à
0.64
are
0.63
ה
0.62
た
0.61
Activations Density 1.150%