INDEX
Explanations
mentions of genders and relationships between boys and girls
New Auto-Interp
Negative Logits
aker
-0.19
wich
-0.18
emen
-0.17
ITTER
-0.16
ermen
-0.16
estre
-0.16
erman
-0.15
itter
-0.15
.psi
-0.15
Niet
-0.14
POSITIVE LOGITS
friend
0.22
Scout
0.22
Scouts
0.21
scout
0.21
hood
0.20
Friend
0.19
cott
0.19
friends
0.19
scouts
0.18
freund
0.17
Activations Density 0.022%