INDEX
    Explanations

    references to gender, specifically male and female

    New Auto-Interp
    Negative Logits
     Notion
    -0.77
    ap
    -0.70
     ?>">
    -0.70
     Conquer
    -0.65
     "]
    -0.65
     }}"></
    -0.64
    ed
    -0.64
     YAP
    -0.64
    man
    -0.64
    dill
    -0.63
    POSITIVE LOGITS
     male
    1.91
     Male
    1.86
     MALE
    1.80
     female
    1.78
     FEMALE
    1.75
     Female
    1.71
    Male
    1.71
    MALE
    1.69
    Female
    1.65
    female
    1.62
    Act Density 0.087%

    No Known Activations