INDEX
Explanations
phrases indicating a preference for a specific option or action
expressions of preference or choice
New Auto-Interp
Negative Logits
rising
-0.74
emin
-0.70
uther
-0.70
afore
-0.64
apt
-0.62
calling
-0.62
anni
-0.61
Loading
-0.60
ahead
-0.60
Behind
-0.60
POSITIVE LOGITS
lose
0.96
than
0.96
spend
0.95
avoid
0.86
starve
0.85
settle
0.81
gamble
0.80
tolerate
0.79
wait
0.79
accept
0.78
Activations Density 0.042%