INDEX
Explanations
phrases indicating lack of surprise
phrases indicating surprise or unexpectedness
New Auto-Interp
Negative Logits
tein
-0.67
ouf
-0.65
Respond
-0.64
abases
-0.63
Role
-0.63
chens
-0.62
eday
-0.62
abouts
-0.62
href
-0.60
href
-0.60
POSITIVE LOGITS
surprise
1.46
shock
1.26
surpr
1.06
shock
1.02
Surprise
0.97
disappointment
0.96
relief
0.93
revelation
0.92
surprises
0.91
surprising
0.88
Activations Density 0.072%