INDEX
Explanations
punctuation marks, specifically periods and exclamation points
New Auto-Interp
Negative Logits
â̦↵↵
-0.19
”
-0.18
...↵↵
-0.18
“
-0.18
--↵↵
-0.17
‘
-0.17
”.
-0.17
’n
-0.17
—↵↵
-0.17
-↵↵
-0.16
POSITIVE LOGITS
.↵
0.27
]↵
0.22
)↵
0.22
).↵
0.22
ा.↵
0.21
â̬↵
0.21
ãĢĤ↵
0.21
}↵
0.21
."↵
0.20
>.↵
0.20
Activations Density 1.408%