INDEX
Explanations
topics related to bias and discrimination, especially in the context of gender and race
New Auto-Interp
Head Attr Weights
0:0.13
1:0.04
2:0.14
3:0.06
4:0.05
5:0.04
6:0.18
7:0.04
8:0.04
9:0.20
10:0.03
11:0.02
Negative Logits
Mortal
-3.23
Turtles
-3.21
lease
-3.19
LU
-3.16
APTER
-3.15
LU
-3.15
Evan
-3.11
VC
-3.10
Pett
-3.08
aiden
-3.07
POSITIVE LOGITS
bias
7.80
biases
7.33
biased
7.11
biased
6.48
prejudices
5.40
prejud
5.35
unbiased
4.92
prejudice
4.87
skew
4.68
skewed
4.49
Activations Density 0.004%