INDEX
Explanations
punctuation marks, specifically quotation marks
New Auto-Interp
Negative Logits
“
-0.17
(“
-0.17
“[
-0.17
‘
-0.17
âĢŀV
-0.16
âĢŀM
-0.16
пÑĢимеÑĢ
-0.15
âĢŀN
-0.15
“Oh
-0.14
âĢŀJ
-0.14
POSITIVE LOGITS
said
0.48
said
0.35
says
0.33
explained
0.24
according
0.24
ÂĿ
0.24
say
0.23
says
0.23
Ñģказал
0.23
wrote
0.22
Activations Density 0.081%