INDEX
    Explanations

    The neuron activates on phrases referring to “moral policing” or related content‐policy warnings.

    New Auto-Interp
    Negative Logits
     kapsamında
    -0.07
     κου
    -0.07
     Summer
    -0.07
     бюдж
    -0.06
     Rut
    -0.06
     Nights
    -0.06
     burning
    -0.06
     uncertainty
    -0.06
     conspic
    -0.06
     желуд
    -0.06
    POSITIVE LOGITS
     policing
    0.09
     responsibly
    0.09
    grass
    0.07
    0.06
     فارس
    0.06
     ViewChild
    0.06
    ragments
    0.06
    resco
    0.06
     babys
    0.06
     داد
    0.06
    Act Density 0.001%

    No Known Activations