INDEX
Explanations
mentions of individuals, particularly women, within the context of their actions, roles, and experiences
New Auto-Interp
Negative Logits
himself
-0.36
Himself
-0.24
妻
-0.24
stesso
-0.22
his
-0.21
ä¿Ĭ
-0.20
ÙĨÙ쨳Ùĩ
-0.19
/she
-0.19
Jr
-0.19
handsome
-0.19
POSITIVE LOGITS
herself
0.57
Ñģама
0.28
могла
0.26
athed
0.24
ová
0.24
ä¸Ī夫
0.24
должна
0.24
ÑģÑĤала
0.23
/he
0.22
Ñģказала
0.22
Activations Density 3.095%