INDEX
Explanations
references to fairness in various contexts
New Auto-Interp
Negative Logits
naire
-0.17
egasus
-0.16
esc
-0.16
ectomy
-0.15
tingham
-0.15
chef
-0.14
ÏĢο
-0.14
ential
-0.14
editor
-0.14
chin
-0.14
POSITIVE LOGITS
grounds
0.18
asan
0.17
ground
0.17
aisal
0.16
aban
0.16
weather
0.16
ably
0.16
mount
0.15
yt
0.15
ness
0.15
Activations Density 0.025%