INDEX
Explanations
protected characteristics and discrimination
New Auto-Interp
Negative Logits
ERROR
0.39
姐妹
0.37
DWORD
0.37
房間
0.37
教程
0.36
燡
0.36
quarts
0.35
suptitle
0.35
eddies
0.34
磬
0.34
POSITIVE LOGITS
religion
0.54
sexism
0.53
nationality
0.50
Religion
0.49
Religion
0.47
Gender
0.47
race
0.46
性别
0.46
Race
0.46
gender
0.45
Activations Density 0.015%