INDEX
Explanations
mentions of the color orange
references to the color orange and aluminum
New Auto-Interp
Negative Logits
risome
-1.13
fare
-0.93
neys
-0.86
awar
-0.85
tsky
-0.84
liness
-0.83
ringe
-0.82
friend
-0.76
rol
-0.73
nam
-0.71
POSITIVE LOGITS
flats
0.73
slic
0.67
cans
0.66
foil
0.66
dotted
0.63
toast
0.63
hops
0.63
skirts
0.61
peel
0.60
simultane
0.59
Activations Density 0.075%