INDEX
    Explanations

    the neuron activates on single-word race labels (like “black” or “white”), detecting mentions of a person’s race.

    New Auto-Interp
    Negative Logits
     shiny
    -0.07
     refr
    -0.07
    修改
    -0.07
     exponential
    -0.07
    -0.07
     пл
    -0.06
     reconnect
    -0.06
     ragazze
    -0.06
    َع
    -0.06
    -0.06
    POSITIVE LOGITS
     coupon
    0.07
    WSC
    0.06
    locker
    0.06
    .stat
    0.06
     tag
    0.06
    =self
    0.06
    dım
    0.06
    timeline
    0.06
    -player
    0.06
     полит
    0.06
    Act Density 0.024%

    No Known Activations