INDEX
    Explanations

    mentions of individuals, particularly women, within the context of their actions, roles, and experiences

    New Auto-Interp
    Negative Logits
     himself
    -0.36
     Himself
    -0.24
    妻
    -0.24
     stesso
    -0.22
    his
    -0.21
    ä¿Ĭ
    -0.20
     ÙĨÙ쨳Ùĩ
    -0.19
    /she
    -0.19
     Jr
    -0.19
     handsome
    -0.19
    POSITIVE LOGITS
     herself
    0.57
     Ñģама
    0.28
     могла
    0.26
    athed
    0.24
    ová
    0.24
    ä¸Ī夫
    0.24
     должна
    0.24
     ÑģÑĤала
    0.23
    /he
    0.22
     Ñģказала
    0.22
    Act Density 3.095%

    No Known Activations