INDEX
Explanations
The neuron activates on words naming protected demographic characteristics (e.g. race, gender, age, religion, ethnicity).
New Auto-Interp
Negative Logits
NSError
-0.06
эксп
-0.06
qint
-0.06
眾
-0.06
Na
-0.06
subcontract
-0.06
fab
-0.06
openhagen
-0.06
Ernst
-0.06
wort
-0.06
POSITIVE LOGITS
interracial
0.09
Race
0.09
racially
0.09
racial
0.09
racial
0.09
race
0.08
Religion
0.07
acial
0.07
IAL
0.07
classList
0.07
Activations Density 0.007%