INDEX
Explanations
punctuation marks, particularly periods and quotation marks indicating dialogue or emphasis
New Auto-Interp
Negative Logits
âĢŀ
-0.28
(“
-0.26
“
-0.24
“â̦
-0.21
=”
-0.20
“[
-0.20
``
-0.19
,“
-0.19
""".
-0.17
''.
-0.17
POSITIVE LOGITS
()"↵
0.17
"↵↵
0.15
âĢº
0.15
`
0.14
/fw
0.14
rodu
0.14
."↵↵
0.14
UTE
0.14
()"
0.13
bove
0.13
Activations Density 0.278%