INDEX
Explanations
phrases about potential outcomes or capabilities
New Auto-Interp
Negative Logits
somehow
-0.19
untime
-0.15
žÃŃ
-0.15
aron
-0.15
Probably
-0.15
à¸Īะ
-0.14
egl
-0.14
irs
-0.14
pery
-0.14
.overlay
-0.14
POSITIVE LOGITS
sometimes
0.34
sometimes
0.28
Sometimes
0.26
be
0.24
Sometimes
0.24
ometimes
0.24
often
0.23
range
0.22
often
0.21
oft
0.20
Activations Density 0.156%