INDEX
    Explanations

    articles and their variations, indicating a focus on nouns or noun phrases

    New Auto-Interp
    Negative Logits
    pth
    -0.18
    777
    -0.17
    321
    -0.16
    821
    -0.15
    804
    -0.15
    906
    -0.15
    dden
    -0.15
     Knot
    -0.14
    venir
    -0.14
    attles
    -0.14
    POSITIVE LOGITS
    isse
    0.16
    anda
    0.15
    ument
    0.15
    pra
    0.15
    UDO
    0.15
    ences
    0.15
    olle
    0.15
    interop
    0.14
    оваÑĢи
    0.14
    .lp
    0.14
    Act Density 0.068%

    No Known Activations