INDEX
    Explanations

    The main thing this neuron does is detect mentions of content restrictions—words like “censorship,” “filtering,” or related moderation terms.

    New Auto-Interp
    Negative Logits
    Buscar
    -0.06
    VIP
    -0.06
     fract
    -0.06
    Cs
    -0.06
     Proxy
    -0.06
    -0.06
     playlists
    -0.06
    -0.06
     Ils
    -0.05
    (Mock
    -0.05
    POSITIVE LOGITS
    0.07
    =path
    0.07
     POSSIBILITY
    0.06
    oined
    0.06
     wonderfully
    0.06
     Sofa
    0.06
    _problem
    0.06
    $template
    0.06
     RTVF
    0.06
    нитель
    0.06
    Act Density 0.003%

    No Known Activations