INDEX
Explanations
mentions of the word "White" in various contexts
New Auto-Interp
Negative Logits
nd
-0.20
nt
-0.17
purple
-0.16
rian
-0.16
epad
-0.15
ments
-0.15
istic
-0.15
nya
-0.14
roz
-0.14
scope
-0.14
POSITIVE LOGITS
supremacist
0.21
-collar
0.21
-white
0.20
bread
0.20
aker
0.20
WHITE
0.19
-trash
0.19
White
0.19
paper
0.19
papers
0.19
Activations Density 0.043%