INDEX
Explanations
model response markers
instruction and formatting cues that dictate response structure, including length directives, explicit answer requests, and markers for code blocks or lists.
New Auto-Interp
Negative Logits
vattum
0.35
sitten
0.34
intestino
0.34
terrasse
0.34
attaques
0.33
बरबाद
0.33
ওষুধের
0.32
ennemis
0.32
trafik
0.32
ét
0.31
POSITIVE LOGITS
.
0.36
,
0.35
;
0.34
:
0.34
、
0.34
)
0.32
↵
0.32
For
0.31
'
0.31
-
0.28
Activations Density 0.272%