INDEX
    Explanations

    The neuron activates on policy‐style permission and prohibition words (e.g. “allowed,” “prohibited”), flagging tokens that signal what is or isn’t permitted.

    New Auto-Interp
    Negative Logits
     фай
    -0.06
    international
    -0.06
    orama
    -0.06
     Readers
    -0.06
    ROY
    -0.06
     mil
    -0.06
                                                                                 
    -0.06
    分钟
    -0.06
     WaitForSeconds
    -0.06
     جهانی
    -0.06
    POSITIVE LOGITS
    0.07
    icontains
    0.07
     zaměstn
    0.07
    umb
    0.07
    elines
    0.06
    emap
    0.06
    .splice
    0.06
     sufficient
    0.06
     amplify
    0.06
    .flatMap
    0.06
    Act Density 0.006%

    No Known Activations