INDEX
    Explanations

    disclaimers and warnings

    The neuron activates on words related to moral, ethical, or policy warnings (e.g., “warning,” “morality,” “ethics,” “safety,” “laws,” “dangers”).

    New Auto-Interp
    Negative Logits
    .Dataset
    -0.07
    root
    -0.07
     ineff
    -0.06
    -history
    -0.06
    iquer
    -0.06
    xbc
    -0.06
    _cats
    -0.06
     IPv
    -0.06
    Separ
    -0.06
    loc
    -0.06
    POSITIVE LOGITS
    OptionsItemSelected
    0.07
    ición
    0.06
    σιμο
    0.06
    0.06
     MIME
    0.06
     Trần
    0.06
    ="/">↵
    0.06
     {!
    0.06
    ethylene
    0.06
     Phật
    0.06
    Act Density 0.003%

    No Known Activations