INDEX
Explanations
comparisons or preferences between two options, typically favoring one over the other
comparisons that emphasize preference for one option over another
New Auto-Interp
Negative Logits
minent
-0.75
amba
-0.75
mberg
-0.75
adium
-0.75
eria
-0.74
ruary
-0.73
ppo
-0.73
cision
-0.72
endale
-0.71
elaide
-0.70
POSITIVE LOGITS
than
0.78
preferring
0.72
unimagin
0.72
pricey
0.70
trivial
0.68
Ide
0.68
inconvenient
0.66
innocuous
0.66
unpleasant
0.65
Leh
0.65
Activations Density 0.015%