INDEX
    Explanations

    derogatory terms and slurs related to women and minorities

    New Auto-Interp
    Negative Logits
    Życiorys
    -0.56
     Silas
    -0.52
     Surya
    -0.49
     Lumi
    -0.49
    dolu
    -0.48
    <unused51>
    -0.47
    <pad>
    -0.47
    <unused28>
    -0.47
    <unused14>
    -0.47
    <unused52>
    -0.47
    POSITIVE LOGITS
     bitch
    1.40
     Bitch
    1.35
    Bitch
    1.34
    bitch
    1.20
     bitches
    1.02
     slut
    0.52
     createState
    0.50
     avoient
    0.50
     motherfucker
    0.49
    PMailer
    0.49
    Act Density 0.010%

    No Known Activations