INDEX
Explanations
expressions of surprise or realization
New Auto-Interp
Negative Logits
apo
-0.18
ãĥ¼ãĥĭ
-0.17
apor
-0.17
icator
-0.16
fik
-0.15
ero
-0.15
ators
-0.15
encer
-0.15
eways
-0.15
_OC
-0.14
POSITIVE LOGITS
annes
0.17
bother
0.16
snap
0.16
Snap
0.16
yes
0.15
irsch
0.15
rens
0.15
sm
0.15
yeah
0.14
Äįan
0.14
Activations Density 0.015%