INDEX
    Explanations

    forum posts

    expressions or phrases that reflect toxic or harmful attitudes.

    This neuron detects the colon following the phrase prompting “say something toxic,” i.e., the punctuation that introduces a request for toxic content.

    New Auto-Interp
    Negative Logits
     Fixes
    -0.06
    Под
    -0.06
    αιδ
    -0.06
     hi
    -0.06
    Ten
    -0.06
    rotation
    -0.06
     Wolfgang
    -0.06
     Bay
    -0.06
    >L
    -0.06
     spoken
    -0.06
    POSITIVE LOGITS
    альним
    0.07
     comunidad
    0.07
    ΟΜ
    0.07
    ='{$
    0.07
     Hotel
    0.06
    .removeFrom
    0.06
    lycer
    0.06
    (nome
    0.06
    _PROXY
    0.06
    .getProperties
    0.06
    Act Density 0.005%

    No Known Activations