INDEX
Explanations
references to male characters or subjects, particularly using third-person pronouns
New Auto-Interp
Negative Logits
ssz
-0.67
Noch
-0.60
aig
-0.59
inva
-0.59
acd
-0.56
comod
-0.55
setShow
-0.54
URS
-0.53
entro
-0.51
similar
-0.51
POSITIVE LOGITS
himself
1.38
he
1.35
himself
1.24
He
1.22
she
1.21
He
1.20
She
1.12
she
1.12
hehe
1.11
Himself
1.10
Activations Density 0.348%