INDEX
    Explanations

    phrases that indicate purpose or intent

    New Auto-Interp
    Negative Logits
    -1.16
     ſind
    -1.13
     betweenstory
    -1.09
    sizeCache
    -1.03
    <unused43>
    -1.03
    <unused23>
    -1.03
    <unused28>
    -1.02
    <unused41>
    -1.02
    <unused14>
    -1.02
    [@BOS@]
    -1.02
    POSITIVE LOGITS
     for
    1.20
     with
    0.82
     by
    0.81
     from
    0.77
     at
    0.75
     to
    0.74
     as
    0.74
     on
    0.73
     has
    0.64
     is
    0.63
    Act Density 0.952%

    No Known Activations