INDEX
    Explanations

    negations or expressions of disagreement

    New Auto-Interp
    Negative Logits
     nor
    -0.17
    辺
    -0.15
    nor
    -0.14
     nackte
    -0.14
    346
    -0.14
    itti
    -0.14
    inee
    -0.14
     nack
    -0.13
    (utf
    -0.13
    atern
    -0.13
    POSITIVE LOGITS
     sure
    0.30
    sure
    0.24
     Sure
    0.23
    Sure
    0.21
     gonna
    0.20
     exactly
    0.19
    gon
    0.17
    icias
    0.17
    tingham
    0.16
    SURE
    0.16
    Act Density 0.052%

    No Known Activations