INDEX
Explanations
references to fairness and fairness-related concepts
New Auto-Interp
Negative Logits
elho
-0.20
omic
-0.17
CHIP
-0.16
ering
-0.15
ova
-0.15
endor
-0.15
owo
-0.14
ote
-0.14
chin
-0.14
chers
-0.14
POSITIVE LOGITS
yt
0.31
ground
0.27
weather
0.26
grounds
0.26
fax
0.24
er
0.22
hart
0.19
iez
0.18
mount
0.17
bnb
0.17
Activations Density 0.026%