INDEX
    Explanations

    This neuron detects user instructions asking the model to analyze, critique, evaluate, or review content.

    New Auto-Interp
    Negative Logits
    Buffer
    -0.07
    σετε
    -0.07
    _buffer
    -0.07
    ूर
    -0.07
     relax
    -0.06
     dine
    -0.06
    oping
    -0.06
    uvwxyz
    -0.06
     frustrations
    -0.06
    }">↵
    -0.06
    POSITIVE LOGITS
     الم
    0.07
     действ
    0.07
     seriously
    0.06
     düş
    0.06
     sont
    0.06
     silicon
    0.06
     vešker
    0.06
    placements
    0.06
     Savaşı
    0.06
     legitimacy
    0.06
    Act Density 0.041%

    No Known Activations