INDEX
    Explanations

    negations or expressions of disagreement

    New Auto-Interp
    Negative Logits
     SEDS
    -0.72
     Roskov
    -0.64
     词
    -0.63
     الرياضيه
    -0.62
    Hentet
    -0.61
     Parac
    -0.58
    IGENCE
    -0.57
    ']}
    -0.56
    ."],
    -0.56
    ()].
    -0.56
    POSITIVE LOGITS
     no
    0.63
    Nope
    0.63
     nope
    0.61
     No
    0.61
     NO
    0.61
     Nope
    0.59
    🙅
    0.59
    nope
    0.59
     فريبيس
    0.58
    No
    0.57
    Act Density 0.061%

    No Known Activations