INDEX
    Explanations

    references to the term "the" and its variations in phrases

    New Auto-Interp
    Negative Logits
    <unused42>
    -1.27
    <unused41>
    -1.27
    <pad>
    -1.26
    [@BOS@]
    -1.26
    <unused68>
    -1.26
    <unused74>
    -1.26
    <unused43>
    -1.26
    <unused23>
    -1.26
    <unused3>
    -1.26
    <unused14>
    -1.26
    POSITIVE LOGITS
    ,
    0.42
    0.40
    1
    0.38
     for
    0.36
     I
    0.35
     and
    0.34
    I
    0.33
    ↵↵
    0.32
    :
    0.30
    The
    0.30
    Act Density 0.300%

    No Known Activations