INDEX
    Explanations

    disclaimers

    This neuron activates on tokens that mark a disclaimer or editorial aside—e.g. “Disclaimer:”, “PS:”, “Edit:”—flagging in-text notes or warning labels.

    sentences expressing discontent or criticism towards social behavior and interactions.

    New Auto-Interp
    Negative Logits
     terrorist
    -0.07
     rovněž
    -0.07
    .Help
    -0.06
     Rico
    -0.06
     roses
    -0.06
    ipment
    -0.06
     Theater
    -0.06
     Tweet
    -0.06
     butterfly
    -0.06
     descr
    -0.06
    POSITIVE LOGITS
     appell
    0.07
    UBLE
    0.07
    _BIN
    0.07
     чис
    0.07
    ouble
    0.07
     $('
    0.06
    accuracy
    0.06
    	loc
    0.06
    월까지
    0.06
     interchangeable
    0.06
    Act Density 0.027%

    No Known Activations