INDEX
Explanations
references to the color white in various contexts
New Auto-Interp
Negative Logits
nt
-0.20
nd
-0.20
purple
-0.18
name
-0.17
nya
-0.17
nts
-0.17
ments
-0.16
rian
-0.16
epad
-0.16
mand
-0.15
POSITIVE LOGITS
supremacist
0.22
-white
0.22
-collar
0.21
legg
0.20
WHITE
0.20
White
0.19
chalk
0.19
caps
0.19
Noise
0.19
White
0.19
Activations Density 0.041%