INDEX
Explanations
negative phrases or words that highlight contradictions or issues
New Auto-Interp
Negative Logits
,
-0.34
's
-0.23
-
-0.21
`s
-0.19
,↵
-0.19
–
-0.18
&apos
-0.18
’s
-0.17
?s
-0.17
�s
-0.17
POSITIVE LOGITS
are
0.18
)ìĿĢ
0.14
came
0.14
were
0.14
has
0.14
was
0.14
leta
0.13
ÈĻi
0.13
gle
0.13
ogn
0.13
Activations Density 0.297%