INDEX
Explanations
terms and concepts related to discrimination and bias
New Auto-Interp
Negative Logits
lify
-0.16
liers
-0.15
aho
-0.15
osphere
-0.14
appa
-0.14
ifier
-0.14
comings
-0.14
íĮĮ
-0.14
rades
-0.14
íĴĪ
-0.14
POSITIVE LOGITS
against
0.28
toward
0.27
towards
0.26
based
0.25
Against
0.24
against
0.23
Against
0.23
experienced
0.22
Towards
0.22
Based
0.20
Activations Density 0.040%