INDEX
Explanations
expressions of surprise or disbelief
emotional reactions or expressions of surprise and realization
New Auto-Interp
Negative Logits
unal
-0.83
ullivan
-0.82
ciplinary
-0.74
occasion
-0.66
Flavoring
-0.66
aband
-0.65
Cosponsors
-0.63
cephal
-0.60
minist
-0.60
ioned
-0.60
POSITIVE LOGITS
?".
0.96
'"
0.91
Hey
0.90
'."
0.85
?'"
0.83
.'"
0.83
hey
0.83
hey
0.81
.")
0.80
â̦."
0.79
Activations Density 0.160%