INDEX
    Explanations

    occurrences of the word "The" and its variations, indicating a focus on specific articles

    "The" followed by titles

    New Auto-Interp
    Negative Logits
    aarrggbb
    -1.09
    SequentialGroup
    -0.96
    featureID
    -0.95
    <unused14>
    -0.91
    <unused8>
    -0.91
    [@BOS@]
    -0.91
    <unused41>
    -0.91
    <unused43>
    -0.91
    <unused28>
    -0.91
    <unused3>
    -0.91
    POSITIVE LOGITS
    The
    1.30
     The
    1.21
    THE
    1.02
     THE
    0.98
    the
    0.89
    ethe
    0.66
     ザ
    0.60
    0.52
    sthe
    0.50
    0.49
    Act Density 0.168%

    No Known Activations