INDEX
    Explanations

    Detects when the model/assistant is producing a long, structured response—activating on tokens that mark assistant-generated content (introductions, headings, list or reply-openers).

    New Auto-Interp
    Negative Logits
    the
    0.94
    n
    0.70
     the
    0.69
    a
    0.69
    or
    0.68
    de
    0.66
    c
    0.65
    <0x99>
    0.58
    to
    0.57
    <0x98>
    0.57
    POSITIVE LOGITS
    𝟬
    0.66
     are
    0.66
    ()=>{
    0.66
    년간
    0.65
    0.65
    0.64
     ہیں۔
    0.63
     سيكون
    0.63
    0.62
    ٠
    0.61
    Act Density 6.869%

    No Known Activations