INDEX
Explanations
pronominal references, especially related to gendered pronouns and their antecedents
New Auto-Interp
Negative Logits
hom
-0.43
arşivlendi
-0.42
num
-0.41
y
-0.41
きれい
-0.41
ev
-0.39
ex
-0.39
haya
-0.39
paci
-0.39
Sco
-0.38
POSITIVE LOGITS
#+#
0.81
>=",
0.80
Signalez
0.78
beginnetje
0.77
++];
0.76
tvguidetime
0.73
Paglinawan
0.72
RTDA
0.72
invokingState
0.71
%)$
0.71
Activations Density 0.267%