INDEX
Explanations
words expressing certainty or affirmation
New Auto-Interp
Negative Logits
rve
-0.17
éĥ¡
-0.16
_tF
-0.16
коп
-0.15
cook
-0.14
umbn
-0.14
ipi
-0.14
anca
-0.14
aja
-0.14
coop
-0.13
POSITIVE LOGITS
be
0.24
been
0.23
most
0.20
sprites
0.15
awa
0.15
AW
0.14
Been
0.14
ï¸ı
0.14
763
0.13
Been
0.13
Activations Density 0.198%