INDEX
Explanations
references to the color white
New Auto-Interp
Negative Logits
nt
-0.23
ments
-0.20
nd
-0.20
nya
-0.18
ment
-0.17
rest
-0.17
ly
-0.16
name
-0.16
AMES
-0.16
ively
-0.15
POSITIVE LOGITS
-collar
0.28
hall
0.26
bread
0.24
caps
0.24
supremacist
0.24
-hot
0.23
papers
0.23
-trash
0.22
legg
0.22
board
0.22
Activations Density 0.033%