INDEX
Explanations
terms indicating a preference for one option over another
statements expressing preference or choice
New Auto-Interp
Negative Logits
breaks
-0.79
runner
-0.74
Runner
-0.73
esi
-0.67
orig
-0.65
eval
-0.65
INAL
-0.64
uss
-0.62
runner
-0.61
ults
-0.61
POSITIVE LOGITS
than
0.81
tolerate
0.74
":["
0.71
prioritize
0.71
Than
0.69
accommodate
0.68
otomy
0.66
afford
0.65
cater
0.64
Intelligent
0.64
Activations Density 0.015%