INDEX
Explanations
conversational exchanges that reflect opinions and thoughts
New Auto-Interp
Negative Logits
simply
-0.17
my
-0.16
only
-0.16
:
-0.16
simple
-0.15
will
-0.14
should
-0.14
cannot
-0.14
uck
-0.14
the
-0.13
POSITIVE LOGITS
yourselves
0.33
yourself
0.30
your
0.24
ä½łçļĦ
0.24
youre
0.23
Yourself
0.22
your
0.22
YOUR
0.20
)?↵
0.20
ваÑĪ
0.19
Activations Density 0.253%