INDEX
Explanations
references to gender-related biases and expectations
New Auto-Interp
Negative Logits
opup
-0.19
osh
-0.15
urge
-0.15
째
-0.14
udu
-0.14
ator
-0.14
ocese
-0.14
angl
-0.14
ogan
-0.13
ÑĢаÐ
-0.13
POSITIVE LOGITS
peg
0.15
304
0.15
enia
0.15
Interrupt
0.15
Pruitt
0.14
oke
0.14
iyon
0.14
átek
0.14
quia
0.13
§
0.13
Activations Density 0.091%