INDEX
    Explanations

    Generic text

    This neuron activates on the formal definition and instruction language used to specify sexual‐content policy (e.g. words like “Content,” “meant,” “arouse,” “excitement,” “such,” “description,” “excluding”).

    New Auto-Interp
    Negative Logits
     Abyss
    -0.07
     vacant
    -0.07
     wreckage
    -0.06
     bliss
    -0.06
     expected
    -0.06
    ircuit
    -0.06
     gone
    -0.06
    obs
    -0.06
    _build
    -0.06
    Injection
    -0.06
    POSITIVE LOGITS
     ANT
    0.07
    %=
    0.06
     kuş
    0.06
    เค
    0.06
     زیبا
    0.06
    0.06
     кора
    0.06
    ="")↵
    0.06
    United
    0.06
     başlayan
    0.06
    Act Density 0.012%

    No Known Activations