INDEX
Explanations
references to issues of race, gender, and social equity
New Auto-Interp
Negative Logits
amba
-0.15
éĩ
-0.14
rey
-0.14
ActionTypes
-0.14
_usec
-0.13
evin
-0.13
ourke
-0.13
underrated
-0.13
inan
-0.13
sag
-0.13
POSITIVE LOGITS
white
0.34
white
0.29
-white
0.29
çϽ
0.28
White
0.26
WHITE
0.26
_white
0.26
çϽ
0.26
White
0.25
whites
0.25
Activations Density 0.166%