INDEX
    Explanations

    references to gender, particularly related to men

    New Auto-Interp
    Negative Logits
    er
    -0.20
    tica
    -0.19
    eriod
    -0.18
    engin
    -0.18
    erif
    -0.17
    erin
    -0.16
    dete
    -0.16
    erde
    -0.16
    ted
    -0.16
    agini
    -0.15
    POSITIVE LOGITS
    folk
    0.49
    opause
    0.46
    aces
    0.38
    ial
    0.38
    ager
    0.35
    aced
    0.34
    ials
    0.33
    /w
    0.33
    's
    0.31
    fol
    0.30
    Act Density 0.034%

    No Known Activations