INDEX
Explanations
questions or comparisons asking to choose between different options
questions that ask for preferences or choices across various contexts
New Auto-Interp
Negative Logits
krit
-0.80
zar
-0.77
ruce
-0.76
fty
-0.75
comings
-0.73
anyahu
-0.70
Corona
-0.69
vity
-0.67
staking
-0.67
figure
-0.67
POSITIVE LOGITS
?",
0.79
ident
0.76
appropri
0.71
desired
0.70
domin
0.69
opter
0.68
accordingly
0.68
?'"
0.68
?),
0.67
deserving
0.66
Activations Density 0.284%