INDEX
    Explanations

    The neuron fires on content‐moderation warning text—particularly on fragments of the word “moderation” in the “YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES” message.

    New Auto-Interp
    Negative Logits
    uste
    -0.07
    xaf
    -0.06
    ến
    -0.06
    らし
    -0.06
     ultimately
    -0.06
    .shapes
    -0.06
     Leah
    -0.06
    らい
    -0.06
     trabaj
    -0.06
    meyi
    -0.06
    POSITIVE LOGITS
    0.07
    ($(
    0.07
     sectarian
    0.06
    ,'
    0.06
     witnessed
    0.06
     vowed
    0.06
     demo
    0.06
     Fucking
    0.06
    	case
    0.06
    .ImageIcon
    0.06
    Act Density 0.001%

    No Known Activations