INDEX
    Explanations

    questions or statements posing a contrary position

    phrases that express negation or doubt

    New Auto-Interp
    Negative Logits
    furt
    -0.77
    ãĤ£
    -0.71
    Ö¼
    -0.68
    Fra
    -0.68
    urst
    -0.66
    het
    -0.63
    ELD
    -0.62
    eteenth
    -0.61
    vironment
    -0.60
    facts
    -0.60
    POSITIVE LOGITS
    hin
    0.85
    ?".
    0.85
     sooner
    0.83
    ?!"
    0.79
    ?",
    0.79
    !?"
    0.76
    ?]
    0.76
    ?"
    0.75
    ?).
    0.75
    ?),
    0.74
    Act Density 0.098%

    No Known Activations