INDEX
    Explanations

    occurrences of expressions indicating the purpose or intent of a research paper or study

    New Auto-Interp
    Negative Logits
    <unused17>
    -0.74
    <unused3>
    -0.74
    <unused28>
    -0.74
    <unused51>
    -0.74
    <unused74>
    -0.74
    [@BOS@]
    -0.74
    <unused8>
    -0.74
    <unused43>
    -0.74
    <unused79>
    -0.74
    <unused41>
    -0.74
    POSITIVE LOGITS
    RunWith
    0.30
     overview
    0.28
     guide
    0.28
     purpose
    0.28
     not
    0.28
     scope
    0.27
     vē
    0.27
    scope
    0.26
     is
    0.26
     paper
    0.26
    Act Density 0.043%

    No Known Activations