INDEX
Explanations
expressions of realization or surprise
New Auto-Interp
Negative Logits
oard
-0.18
ienie
-0.17
evice
-0.16
eros
-0.16
ical
-0.15
incare
-0.15
ural
-0.15
indr
-0.14
een
-0.14
inent
-0.14
POSITIVE LOGITS
318
0.16
reel
0.15
es
0.15
/welcome
0.15
rem
0.15
Ree
0.15
Movement
0.14
boy
0.14
rem
0.14
itters
0.14
Activations Density 0.040%