INDEX
Explanations
part of word endings
the apostrophe character in English contractions.
This neuron detects mentions of large language models and related training processes.
New Auto-Interp
Negative Logits
0.47
is
0.43
a
0.42
(
0.35
,
0.31
{0.30
it
0.29
to
0.29
،
0.29
of
0.28
POSITIVE LOGITS
and
0.45
на
0.45
ون
0.44
z
0.40
u
0.38
f
0.37
in
0.36
ل
0.36
b
0.35
d
0.34
Activations Density 16.856%