INDEX
    Explanations

    references to gender inequalities and societal expectations

    New Auto-Interp
    Negative Logits
    icari
    -0.10
    arget
    -0.09
    indir
    -0.08
    รม
    -0.08
    aris
    -0.08
    allah
    -0.08
    ãĤĤãĤĬ
    -0.08
    aç
    -0.08
    onse
    -0.08
    anzi
    -0.07
    POSITIVE LOGITS
     male
    0.24
     males
    0.20
     Male
    0.18
    male
    0.18
    çĶ·æĢ§
    0.16
     masculine
    0.16
    Male
    0.15
     men
    0.14
     мÑĥжÑĩин
    0.14
     mascul
    0.14
    Act Density 0.032%

    No Known Activations