INDEX
Explanations
words related to work, pay, and possibly gender discrimination
New Auto-Interp
Negative Logits
enance
-0.74
luence
-0.68
ablished
-0.63
equality
-0.62
resents
-0.61
CONCLUS
-0.60
ablish
-0.60
Failure
-0.59
edience
-0.58
ifference
-0.58
POSITIVE LOGITS
!).
1.52
?).
1.47
!),
1.41
!)
1.34
?),
1.33
).
1.25
*)
1.24
>)
1.23
?)
1.21
)).
1.19
Activations Density 0.615%