INDEX
    Explanations

    negative stereotypes and how they reinforce harmful narratives

    New Auto-Interp
    Negative Logits
    gur
    -0.86
    sterdam
    -0.85
    ilated
    -0.76
    Aid
    -0.75
     Journals
    -0.75
    ayan
    -0.75
    cel
    -0.73
    imentary
    -0.73
    ates
    -0.72
    keeping
    -0.72
    POSITIVE LOGITS
    è¦ļéĨĴ
    1.05
    pmwiki
    1.04
     stereotyp
    0.99
     trope
    0.92
     tropes
    0.89
     clich
    0.87
    enegger
    0.86
    ALLY
    0.80
    rities
    0.80
     stereotypes
    0.78
    Act Density 6.969%

    No Known Activations