INDEX
    Explanations

    assistant/model responses that provide structured explanations or evaluations—especially noting flaws, limitations, or following task instructions.

    New Auto-Interp
    Negative Logits
     positive
    0.72
     enrich
    0.69
     enriching
    0.69
     stär
    0.69
    enrich
    0.68
     exhilarating
    0.66
     Enh
    0.66
     favorable
    0.65
     enhancing
    0.65
    喜爱
    0.64
    POSITIVE LOGITS
     useless
    1.72
     failed
    1.68
     ineffective
    1.64
    failed
    1.59
     incapable
    1.55
     futile
    1.53
     pointless
    1.53
     Failed
    1.50
     worthless
    1.49
     unsuccessful
    1.48
    Act Density 3.409%

    No Known Activations