INDEX
Explanations
references to the color black and related concepts
New Auto-Interp
Negative Logits
ics
-0.69
ison
-0.43
als
-0.39
hawks
-0.39
burn
-0.39
agascar
-0.39
mund
-0.38
win
-0.38
ropolitan
-0.38
ational
-0.37
POSITIVE LOGITS
ges
0.17
abilities
0.17
urry
0.16
rani
0.15
minded
0.15
rub
0.15
ight
0.14
puted
0.14
leen
0.14
Flake
0.14
Activations Density 0.054%