INDEX
    Explanations

    predicting word after positive

    New Auto-Interp
    Negative Logits
    1.14
    1.00
    0.99
    In
    0.98
    0.98
    0.98
    0.97
    ING
    0.96
    Además
    0.96
    ל
    0.95
    POSITIVE LOGITS
    ма
    1.15
    1.15
    <0x80>
    1.06
    тна
    1.05
    i
    1.02
    il
    1.02
    1.01
     positive
    0.97
    is
    0.94
    ang
    0.94
    Act Density 0.035%

    No Known Activations