INDEX
Explanations
defining concepts or topics
New Auto-Interp
Negative Logits
(
0.44
We
0.44
we
0.44
więc
0.43
dessa
0.42
podemos
0.40
vardır
0.40
=
0.39
puedes
0.39
you
0.39
POSITIVE LOGITS
amid
0.59
.«
0.58
.”
0.50
.''
0.49
."
0.49
.`
0.48
despite
0.46
.“
0.46
".
0.44
.’
0.43
Activations Density 0.064%