INDEX
Explanations
names, particularly those related to familial relationships
New Auto-Interp
Negative Logits
jr
-0.17
halb
-0.16
íͽ
-0.15
isbury
-0.15
JR
-0.14
завиÑģим
-0.14
Jr
-0.14
koji
-0.14
âĨIJ
-0.14
uncle
-0.14
POSITIVE LOGITS
n
0.26
ne
0.23
herself
0.20
geb
0.17
Augusta
0.17
Adelaide
0.16
rica
0.16
etta
0.16
Theresa
0.16
$n
0.15
Activations Density 0.109%