INDEX
Explanations
the followed by specific nouns
the introduction of specific concepts
New Auto-Interp
Negative Logits
ﺍ
0.66
6
0.63
5
0.61
}=
0.59
}.
0.56
4
0.56
-
0.55
erhältlich
0.54
8
0.53
geheel
0.53
POSITIVE LOGITS
to
1.17
that
0.82
که
0.69
at
0.68
이
0.64
it
0.63
be
0.60
of
0.58
د
0.56
by
0.52
Activations Density 0.592%