INDEX
Explanations
references to the term "boy" and related gendered terms
New Auto-Interp
Negative Logits
wich
-0.18
.psi
-0.16
aker
-0.16
oller
-0.16
agrid
-0.15
ITTER
-0.15
obox
-0.15
nop
-0.15
ial
-0.14
ela
-0.14
POSITIVE LOGITS
friend
0.25
Scout
0.22
Scouts
0.21
friends
0.20
Friend
0.20
hood
0.19
riend
0.19
Wonder
0.19
band
0.19
arin
0.18
Activations Density 0.014%