INDEX
    Explanations

    The main thing this neuron does is detect mentions of self-harm and related behaviors.

    New Auto-Interp
    Negative Logits
    Radio
    -0.07
     Oscars
    -0.06
     Diego
    -0.06
     Nodes
    -0.06
    modity
    -0.06
    лина
    -0.06
    (크기
    -0.06
     Geschichte
    -0.06
    ニメ
    -0.06
    یلی
    -0.06
    POSITIVE LOGITS
     gentle
    0.07
    0.07
    _BACK
    0.07
    .den
    0.06
    ”↵↵
    0.06
    .ADMIN
    0.06
     way
    0.06
    .':
    0.06
    DEF
    0.06
    /contentassist
    0.06
    Act Density 0.004%

    No Known Activations