INDEX
Explanations
references to the color "white."
New Auto-Interp
Negative Logits
acles
-0.16
nt
-0.16
nown
-0.16
enne
-0.16
rian
-0.15
istic
-0.15
ogle
-0.15
acle
-0.15
nts
-0.14
ret
-0.14
POSITIVE LOGITS
-collar
0.20
supremacist
0.18
ened
0.17
-white
0.17
papers
0.16
paper
0.15
isz
0.15
/black
0.15
enment
0.15
453
0.14
Activations Density 0.041%