INDEX
Explanations
Detects when the model/assistant is producing a long, structured response—activating on tokens that mark assistant-generated content (introductions, headings, list or reply-openers).
New Auto-Interp
Negative Logits
the
0.94
n
0.70
the
0.69
a
0.69
or
0.68
de
0.66
c
0.65
<0x99>
0.58
to
0.57
<0x98>
0.57
POSITIVE LOGITS
𝟬
0.66
are
0.66
()=>{0.66
년간
0.65
৯
0.65
৮
0.64
ہیں۔
0.63
سيكون
0.63
৭
0.62
٠
0.61
Activations Density 6.869%