INDEX
    Explanations

    phrases related to discrimination and prejudice, particularly focusing on sexism

    terms related to sexism and misogyny

    New Auto-Interp
    Negative Logits
    Trust
    -0.76
    leaf
    -0.73
    Package
    -0.71
    hyde
    -0.71
    mental
    -0.69
    ernels
    -0.68
    uilding
    -0.68
    VIS
    -0.68
    ving
    -0.67
    NAS
    -0.67
    POSITIVE LOGITS
     sexist
    1.03
     misogyn
    0.89
     slurs
    0.87
     jokes
    0.81
     stereotypes
    0.79
     stereotyp
    0.78
     banter
    0.76
     Equality
    0.74
     feminists
    0.73
     sexism
    0.73
    Act Density 0.018%

    No Known Activations