INDEX
    Explanations

    occurrences of the word "the"

    New Auto-Interp
    Negative Logits
    [@BOS@]
    -0.99
    <unused43>
    -0.99
    <unused41>
    -0.99
    <unused74>
    -0.98
    <unused14>
    -0.98
    <unused8>
    -0.98
    <unused17>
    -0.98
    <unused23>
    -0.98
    <unused16>
    -0.98
    <unused3>
    -0.98
    POSITIVE LOGITS
     we
    0.42
     I
    0.38
     he
    0.36
    -
    0.32
     she
    0.28
     it
    0.28
     there
    0.27
     no
    0.27
    g
    0.26
    p
    0.26
    Act Density 0.135%

    No Known Activations