INDEX
Explanations
phrases related to fairness and honesty
New Auto-Interp
Negative Logits
ELLOW
-0.15
SSERT
-0.15
uem
-0.14
astos
-0.14
ego
-0.14
ecess
-0.14
esinden
-0.14
ÏĦÎŃ
-0.14
rtl
-0.14
crim
-0.13
POSITIVE LOGITS
fairness
0.22
fair
0.18
fair
0.17
oppel
0.17
Fair
0.17
Fortune
0.16
credit
0.16
itzer
0.15
Paren
0.14
Fair
0.14
Activations Density 0.070%