INDEX
Explanations
phrases indicating a preference for one option over another
phrases that emphasize a preference for alternatives or comparisons
New Auto-Interp
Negative Logits
iard
-0.64
mberg
-0.63
Coffee
-0.60
andal
-0.57
endale
-0.57
esi
-0.57
adium
-0.57
DRAG
-0.56
amba
-0.55
enos
-0.55
POSITIVE LOGITS
than
0.96
than
0.82
iving
0.68
acent
0.66
rame
0.65
preferring
0.65
trivial
0.63
othes
0.62
frivolous
0.62
usions
0.61
Activations Density 0.022%