INDEX
    Explanations

    phrases indicating causation or reasoning, particularly with the word "because."

    New Auto-Interp
    Negative Logits
    lez
    -0.72
    wn
    -0.71
    robe
    -0.69
    agin
    -0.69
    ax
    -0.69
    mint
    -0.66
    lem
    -0.65
    nin
    -0.64
    Gas
    -0.64
    yan
    -0.63
    POSITIVE LOGITS
     they
    1.04
     nobody
    0.90
     there
    0.89
     it
    0.84
     unlike
    0.82
     otherwise
    0.81
     we
    0.79
     THEY
    0.79
    */(
    0.78
     he
    0.75
    Act Density 0.561%

    No Known Activations