INDEX
Explanations
references to male characters and pronouns
New Auto-Interp
Negative Logits
ssz
-0.61
setShow
-0.60
inva
-0.59
aig
-0.59
similar
-0.56
Noch
-0.56
comod
-0.54
fora
-0.53
ancia
-0.53
dima
-0.53
POSITIVE LOGITS
himself
1.42
he
1.35
himself
1.30
He
1.24
she
1.22
He
1.22
tvguidetime
1.20
Himself
1.17
OGND
1.15
she
1.14
Activations Density 0.285%