INDEX
Explanations
forward and backward direction
New Auto-Interp
Negative Logits
ABCDEFGHIJKLMNOP
-0.11
evin
-0.11
-âĢIJ
-0.10
aille
-0.10
enti
-0.09
ec
-0.09
izu
-0.09
oke
-0.09
utter
-0.09
kle
-0.09
POSITIVE LOGITS
-thinking
0.21
-looking
0.20
backward
0.17
ly
0.17
/back
0.17
-facing
0.15
-forward
0.15
ness
0.15
slash
0.14
-back
0.13
Activations Density 0.027%