INDEX
    Explanations

    discriminatory language related to race, gender, sexual orientation, and religion

    instances of discriminatory language or terms related to prejudice and bigotry

    New Auto-Interp
    Negative Logits
    hower
    -0.87
    leaf
    -0.82
    pletion
    -0.79
    flix
    -0.77
    change
    -0.76
    pring
    -0.75
    ources
    -0.75
    imum
    -0.75
    ership
    -0.73
    aper
    -0.72
    POSITIVE LOGITS
     slurs
    1.48
     stereotypes
    1.02
     prejudice
    1.00
     jokes
    0.99
     homophobic
    0.97
     tir
    0.96
     slur
    0.96
     bigot
    0.95
     sexist
    0.94
     prejud
    0.94
    Act Density 0.077%

    No Known Activations