INDEX
Explanations
instances of rejection or refusal in decision-making contexts
New Auto-Interp
Negative Logits
.Annotations
-0.16
ERRU
-0.15
clare
-0.15
ubo
-0.15
arkan
-0.15
loub
-0.14
asje
-0.14
ctl
-0.14
;br
-0.14
ENCIL
-0.14
POSITIVE LOGITS
offers
0.24
Reject
0.23
offer
0.23
Reject
0.23
reject
0.23
rejected
0.23
rejection
0.23
reject
0.22
offered
0.22
declined
0.21
Activations Density 0.125%