INDEX
    Explanations

    forward and backward direction

    New Auto-Interp
    Negative Logits
    ABCDEFGHIJKLMNOP
    -0.11
    evin
    -0.11
     -âĢIJ
    -0.10
    aille
    -0.10
    enti
    -0.09
    ec
    -0.09
    izu
    -0.09
    oke
    -0.09
    utter
    -0.09
    kle
    -0.09
    POSITIVE LOGITS
    -thinking
    0.21
    -looking
    0.20
    backward
    0.17
    ly
    0.17
    /back
    0.17
    -facing
    0.15
    -forward
    0.15
    ness
    0.15
    slash
    0.14
    -back
    0.13
    Act Density 0.027%

    No Known Activations