INDEX
Explanations
phrases that indicate direction or destination
New Auto-Interp
Negative Logits
up
-0.20
rup
-0.17
wright
-0.17
rift
-0.15
rol
-0.15
au
-0.15
wick
-0.15
nt
-0.14
nap
-0.14
exactly
-0.14
POSITIVE LOGITS
gether
0.22
obus
0.20
asting
0.20
tes
0.20
chter
0.20
/from
0.20
OLS
0.20
ools
0.20
pline
0.20
wner
0.19
Activations Density 0.142%