INDEX
Explanations
pronouns referring to people and their actions
New Auto-Interp
Negative Logits
aig
-0.61
inva
-0.60
ssz
-0.58
mín
-0.56
accesso
-0.56
incu
-0.56
olas
-0.55
pomo
-0.55
pary
-0.55
を取る
-0.54
POSITIVE LOGITS
he
1.50
He
1.36
she
1.33
He
1.31
she
1.25
himself
1.24
She
1.23
himself
1.20
THEY
1.18
She
1.17
Activations Density 0.181%