INDEX
    Explanations

    The neuron detects language about disallowed or harmful behavior (e.g. “harmful,” “toxic,” “violence,” “behavior,” “promote,” “encourage”).

    New Auto-Interp
    Negative Logits
    UTTON
    -0.07
     sağlık
    -0.06
     AssemblyTitle
    -0.06
    obi
    -0.06
    SP
    -0.06
    (""));↵
    -0.06
    registration
    -0.06
    Unary
    -0.05
     polic
    -0.05
    NAS
    -0.05
    POSITIVE LOGITS
    pom
    0.07
     chóng
    0.07
     wrestling
    0.07
     Sampling
    0.07
    rgba
    0.07
     Difficulty
    0.06
     DH
    0.06
     past
    0.06
     ell
    0.06
    Stage
    0.06
    Act Density 0.022%

    No Known Activations