INDEX
    Explanations

    The neuron fires on words that frame or qualify disallowed requests as “educational,” “scientific,” “research,” or “professional” purposes.

    New Auto-Interp
    Negative Logits
     Indicates
    -0.08
    asto
    -0.07
     pys
    -0.07
     Www
    -0.06
    affles
    -0.06
    _office
    -0.06
     Soccer
    -0.06
     François
    -0.06
     Inches
    -0.06
     press
    -0.06
    POSITIVE LOGITS
    online
    0.07
    '',
    0.07
     будь
    0.06
    Startup
    0.06
     regular
    0.06
    _strategy
    0.06
    )
    ↵
    ↵
    ↵
    0.06
     titulo
    0.06
    _div
    0.06
     adı
    0.06
    Act Density 0.020%

    No Known Activations