INDEX
    Explanations

    negations or expressions of disagreement

    New Auto-Interp
    Negative Logits
     not
    -0.18
     não
    -0.16
     never
    -0.15
     nicht
    -0.14
     не
    -0.14
     no
    -0.14
     niet
    -0.13
    ä¸įå¾Ĺ
    -0.13
    ummings
    -0.13
    uars
    -0.13
    POSITIVE LOGITS
    ched
    0.27
     necessarily
    0.26
    ori
    0.25
    tingham
    0.25
     anymore
    0.24
     yet
    0.23
    ching
    0.22
    ches
    0.22
    epad
    0.22
    oriously
    0.22
    Act Density 0.265%

    No Known Activations