INDEX
Explanations
phrases indicating a choice or decision point
phrases indicating choice or opinion, specifically contrasting options
New Auto-Interp
Negative Logits
ourses
-0.66
inational
-0.66
xus
-0.66
vier
-0.64
marked
-0.62
urate
-0.61
umper
-0.61
runner
-0.60
ipal
-0.60
uph
-0.59
POSITIVE LOGITS
lando
0.88
acle
0.81
hate
0.75
acles
0.74
Else
0.73
Bust
0.73
not
0.72
hate
0.70
starve
0.70
lose
0.69
Activations Density 0.052%