INDEX
    Explanations

    This neuron detects meta‐instruction language, especially the word “role” and related role-play directives.

    New Auto-Interp
    Negative Logits
     /\
    -0.07
    .MainActivity
    -0.06
     activations
    -0.06
    Comput
    -0.06
    -0.06
     nick
    -0.06
     deren
    -0.06
    Pok
    -0.06
     forgive
    -0.06
    stin
    -0.06
    POSITIVE LOGITS
     Venezuelan
    0.07
     Auckland
    0.07
     Metals
    0.06
    clearfix
    0.06
    선거
    0.06
     thực
    0.06
     absolute
    0.06
     categor
    0.06
     MLA
    0.06
    ısından
    0.06
    Act Density 0.001%

    No Known Activations