INDEX
Explanations
apologies or expressions of regret
New Auto-Interp
Negative Logits
alo
-0.15
imary
-0.15
possibilities
-0.14
ini
-0.14
šet
-0.14
imen
-0.14
irk
-0.14
alli
-0.14
QUIT
-0.14
jeopardy
-0.13
POSITIVE LOGITS
couldn
0.18
/not
0.17
kus
0.17
bout
0.16
813
0.16
meant
0.16
couldn
0.15
bout
0.15
Mods
0.15
ably
0.15
Activations Density 0.030%