INDEX
Explanations
comparisons that emphasize preference or prioritization
comparative phrases emphasizing preference or alternatives
New Auto-Interp
Negative Logits
ppo
-0.75
amba
-0.73
ruary
-0.72
mberg
-0.71
eria
-0.71
ocaust
-0.70
erto
-0.70
uay
-0.69
draft
-0.69
adium
-0.68
POSITIVE LOGITS
unimagin
0.78
than
0.69
innocuous
0.69
Ide
0.69
distinguish
0.68
rather
0.67
irrelevant
0.67
metic
0.67
amusing
0.66
preferring
0.65
Activations Density 0.016%