INDEX
    Explanations

    Instructions, opinions

    The neuron flags instructional language that guides users to carry out unsafe, unethical, or otherwise disallowed actions.

    any mention of dangerous or harmful activities and issues surrounding consent and ethics.

    New Auto-Interp
    Negative Logits
    .URL
    -0.07
    top
    -0.07
    Frequency
    -0.07
     synergy
    -0.06
     Houses
    -0.06
     Sections
    -0.06
     benchmark
    -0.06
    Orders
    -0.06
    fish
    -0.06
     Velocity
    -0.06
    POSITIVE LOGITS
     dern
    0.08
    amız
    0.07
    ']?>"
    0.06
     اما
    0.06
     trough
    0.06
    0.06
     siyas
    0.06
     aktif
    0.06
    imiz
    0.06
     ขนาด
    0.06
    Act Density 0.041%

    No Known Activations