INDEX
Explanations
phrases beginning with "I know"
instances of self-awareness or acknowledgment of knowledge
New Auto-Interp
Negative Logits
issance
-0.79
pione
-0.78
onding
-0.77
exting
-0.76
orthy
-0.75
pmwiki
-0.69
ā
-0.66
ermanent
-0.66
þ
-0.66
Đ
-0.66
POSITIVE LOGITS
firsthand
1.18
how
1.02
plenty
0.98
exactly
0.94
what
0.93
lots
0.90
anecd
0.89
why
0.89
many
0.89
alot
0.85
Activations Density 0.057%