INDEX
    Explanations

    disclaimers/rules

    The neuron chiefly responds to punctuation tokens (commas and periods), especially in the assistant’s refusal/apology phrasing.

    responses that promote respect and non-discrimination towards individuals and groups.

    New Auto-Interp
    Negative Logits
    латы
    -0.07
     vej
    -0.06
     pastry
    -0.06
    ников
    -0.06
     redistributed
    -0.06
    owan
    -0.06
     Gad
    -0.06
    idders
    -0.06
    -0.06
    اهش
    -0.06
    POSITIVE LOGITS
    :↵
    0.07
    年の
    0.07
    ---------↵
    0.07
    Large
    0.07
     gerekmektedir
    0.06
     [];
    ↵
    0.06
    _cou
    0.06
    .ToBoolean
    0.06
    .*↵
    0.06
    .auth
    0.06
    Act Density 0.036%

    No Known Activations