INDEX
    Explanations

    The neuron fires on key words and phrases that name or introduce unsafe‐content categories (e.g. “sexual…arou­se,” “promotes,” “depicts,” “incites,” “self-harm,” “violence,” etc.), effectively marking tokens that specify policy violation types.

    New Auto-Interp
    Negative Logits
    άνα
    -0.07
    ackages
    -0.07
     ITS
    -0.06
    	items
    -0.06
    .buffer
    -0.06
    ERT
    -0.06
    Server
    -0.06
     dimensions
    -0.06
    monitor
    -0.06
    isateur
    -0.06
    POSITIVE LOGITS
    ,一
    0.08
    ebi
    0.07
    rtype
    0.07
     starttime
    0.07
    dığını
    0.06
     Tub
    0.06
     eBook
    0.06
    adolu
    0.06
     Hoover
    0.06
     quarterbacks
    0.06
    Act Density 0.016%

    No Known Activations