INDEX
Explanations
common sentence starters
introduces explanations or examples
New Auto-Interp
Negative Logits
ला
0.49
ra
0.47
h
0.47
’
0.46
نا
0.46
ap
0.45
ون
0.44
l
0.43
k
0.43
ri
0.42
POSITIVE LOGITS
was
0.57
be
0.51
had
0.47
ة
0.45
of
0.44
with
0.44
avec
0.42
is
0.42
với
0.42
and
0.41
Activations Density 1.640%