INDEX
    Explanations

    phrases indicating reasoning or causation

    the word "because" in various contexts

    New Auto-Interp
    Negative Logits
    wn
    -0.79
    shaw
    -0.76
    ardon
    -0.74
    agin
    -0.73
    ns
    -0.72
    yan
    -0.68
    ery
    -0.67
    jet
    -0.67
    yr
    -0.66
    mint
    -0.66
    POSITIVE LOGITS
    */(
    0.78
     they
    0.73
     anecd
    0.64
     proxies
    0.64
    ecause
    0.63
     there
    0.63
     nobody
    0.62
     frankly
    0.61
     we
    0.60
     mathematic
    0.60
    Act Density 0.068%

    No Known Activations