INDEX
Explanations
references to female and male characters in the text
pronouns describing people
New Auto-Interp
Negative Logits
RegressionTest
-0.44
propOrder
-0.41
気がする
-0.40
})()
-0.39
kwds
-0.38
confusion
-0.37
Alford
-0.37
falsche
-0.37
IntoConstraints
-0.37
logging
-0.36
POSITIVE LOGITS
ftagPool
0.52
новништво
0.52
virkelig
0.51
Geplaatst
0.49
fjspx
0.47
adpleegd
0.47
Билгалдахарш
0.47
zydent
0.47
astore
0.46
nahilalakip
0.43
Activations Density 0.045%