INDEX
Explanations
color-related words
references to colors and various social issues
New Auto-Interp
Negative Logits
Legend
-0.64
1001
-0.63
Operation
-0.58
Nav
-0.56
ª
-0.55
FY
-0.54
TOP
-0.54
ãĥīãĥ©
-0.54
Reloaded
-0.53
äºĶ
-0.53
POSITIVE LOGITS
alike
1.65
respectively
1.63
etc
1.00
and
0.81
interchange
0.80
together
0.77
weren
0.75
blah
0.75
versa
0.74
AND
0.70
Activations Density 0.351%