INDEX
Explanations
references to sources, potentially citations or footnotes
formatted text or structures within the document
New Auto-Interp
Negative Logits
oor
-0.80
ãĤ©
-0.77
ãĥ¼ãĥ³
-0.71
ient
-0.71
æ©
-0.69
ãĥ¼ãĥĨãĤ£
-0.68
ãĤ¤ãĥĪ
-0.68
çīĪ
-0.67
ãĤ£
-0.67
é¾įå
-0.67
POSITIVE LOGITS
...]
1.43
â̦]
1.23
?]
0.99
][
0.91
Pg
0.90
.]
0.88
].
0.87
][
0.86
]
0.86
!]
0.85
Activations Density 0.021%