INDEX
Explanations
phrases related to surprises
phrasing related to expectations or surprises
New Auto-Interp
Negative Logits
audi
-0.66
abulary
-0.62
erential
-0.61
ortex
-0.59
filibuster
-0.57
oyer
-0.57
orce
-0.56
Suc
-0.56
Logged
-0.56
adelphia
-0.55
POSITIVE LOGITS
pires
0.81
follows
0.74
pired
0.73
regards
0.68
unwelcome
0.65
well
0.64
lies
0.63
soon
0.61
piration
0.61
opposed
0.60
Activations Density 0.043%