INDEX
Explanations
references to female characters and their emotions or actions
New Auto-Interp
Negative Logits
妻
-0.27
wife
-0.26
Wife
-0.22
/she
-0.21
wife
-0.21
himself
-0.20
ÙĨÙ쨳Ùĩ
-0.19
sing
-0.17
seul
-0.16
ship
-0.16
POSITIVE LOGITS
/he
0.39
herself
0.39
athed
0.36
pher
0.35
esh
0.33
pherd
0.30
ikh
0.30
ffield
0.27
athing
0.27
pard
0.26
Activations Density 0.147%