INDEX
    Explanations

    pronouns and their various forms, particularly in the context of male subjects

    New Auto-Interp
    Negative Logits
    vyk
    -0.17
    cido
    -0.15
    ray
    -0.15
    TestCategory
    -0.15
     sm
    -0.15
    oct
    -0.15
    504
    -0.15
    Metro
    -0.15
    crest
    -0.14
    vents
    -0.14
    POSITIVE LOGITS
    ster
    0.24
    inner
    0.24
    wor
    0.23
    stm
    0.21
    inn
    0.21
    öff
    0.20
    kan
    0.20
    he
    0.19
    hi
    0.18
    mut
    0.18
    Act Density 0.006%

    No Known Activations