INDEX
    Explanations

    phrases indicating justification or reasoning

    New Auto-Interp
    Negative Logits
    wn
    -0.84
    ns
    -0.78
    robe
    -0.78
    ety
    -0.74
    shaw
    -0.72
    mint
    -0.70
    emi
    -0.70
    jet
    -0.69
    spect
    -0.68
    ses
    -0.68
    POSITIVE LOGITS
     unlike
    0.79
     they
    0.78
     nobody
    0.76
     fuck
    0.74
     obviously
    0.74
     there
    0.74
     otherwise
    0.73
     hey
    0.73
     evidenced
    0.72
     frankly
    0.70
    Act Density 0.060%

    No Known Activations