INDEX
Explanations
references to race and gender dynamics
New Auto-Interp
Negative Logits
gend
-0.18
zt
-0.17
esi
-0.15
gens
-0.15
ocks
-0.15
athan
-0.15
Mig
-0.14
rem
-0.14
chez
-0.14
leh
-0.14
POSITIVE LOGITS
readcr
0.15
Äįer
0.15
ạp
0.15
.dex
0.15
åĢĴ
0.15
exclusive
0.15
/Foundation
0.15
monds
0.14
-Clause
0.14
ean
0.14
Activations Density 0.206%