INDEX
Explanations
expressions of desire or intention
New Auto-Interp
Negative Logits
ạm
-0.17
yourselves
-0.16
Yourself
-0.16
udas
-0.14
ÑĢÑİ
-0.14
zent
-0.14
med
-0.14
inen
-0.14
hy
-0.14
rint
-0.14
POSITIVE LOGITS
to
0.24
nothing
0.21
us
0.21
entially
0.21
only
0.18
them
0.17
να
0.17
/ne
0.16
feedback
0.16
/
0.16
Activations Density 0.070%