INDEX
    Explanations

    references to historical or fictional female heroes

    New Auto-Interp
    Negative Logits
    ners
    -0.18
    peng
    -0.16
    olas
    -0.16
    gang
    -0.15
    ster
    -0.15
    ty
    -0.15
    mate
    -0.14
    ety
    -0.14
    ings
    -0.14
    ม
    -0.14
    POSITIVE LOGITS
    ines
    0.24
    ically
    0.22
    ics
    0.20
    anova
    0.18
    ine
    0.18
    ism
    0.17
    itics
    0.16
    ic
    0.16
    897
    0.16
    MES
    0.16
    Act Density 0.015%

    No Known Activations