INDEX
Explanations
also, followed by words
followed by question words
New Auto-Interp
Negative Logits
ig
0.86
ib
0.84
로
0.81
ла
0.79
ri
0.77
پ
0.77
ро
0.75
ون
0.75
ти
0.72
ur
0.71
POSITIVE LOGITS
is
0.92
0.91
它
0.82
be
0.77
was
0.76
an
0.75
it
0.71
on
0.70
of
0.67
è
0.67
Activations Density 0.373%