INDEX
Explanations
pronouns and possessive forms related to individuals and communities
subjects/objects after certain pronouns/articles
New Auto-Interp
Negative Logits
and
-0.43
,
-0.39
1
-0.36
;
-0.35
2
-0.35
both
-0.34
3
-0.31
-
-0.30
either
-0.30
y
-0.30
POSITIVE LOGITS
ſind
1.02
Verſ
1.02
verſ
0.95
myſelf
0.94
queſta
0.91
majánló
0.90
ſſung
0.90
ſta
0.90
<unused3>
0.88
<unused68>
0.88
Activations Density 0.141%