INDEX
Explanations
negations or denials paired with adjectives
phrases emphasizing absence or lack of something
New Auto-Interp
Negative Logits
orks
-0.80
flies
-0.74
yx
-0.68
fs
-0.68
UME
-0.67
rib
-0.66
olds
-0.66
haul
-0.66
die
-0.64
Chains
-0.64
POSITIVE LOGITS
else
0.79
hidden
0.77
buried
0.72
intrinsic
0.70
objectionable
0.70
overlap
0.70
happening
0.69
poetic
0.68
lurking
0.68
shameful
0.67
Activations Density 0.044%