INDEX
Explanations
the concept of neutrality or neutral states
New Auto-Interp
Negative Logits
<eos>
-0.58
(
-0.54
esclavos
-0.52
nd
-0.51
…
-0.51
convective
-0.51
povo
-0.50
ว
-0.50
↵
-0.50
{-0.50
POSITIVE LOGITS
neutral
1.54
Neutral
1.47
UTRAL
1.45
neutral
1.44
Neutral
1.38
neutre
1.34
нейтра
1.30
neutrals
1.27
hurt
1.25
neutrality
1.25
Activations Density 0.108%