INDEX
Explanations
body parts, concepts, and behaviors
New Auto-Interp
Negative Logits
a
1.10
0.79
i
0.77
an
0.75
an
0.70
lN
0.63
ta
0.62
dana
0.61
k
0.59
as
0.59
POSITIVE LOGITS
và
0.85
and
0.81
ни
0.76
are
0.75
у
0.75
και
0.74
ми
0.70
де
0.68
с
0.68
políticos
0.68
Activations Density 3.541%