INDEX
    Explanations

    expressions of gratitude or thanks

    New Auto-Interp
    Negative Logits
    Default
    -0.67
    APD
    -0.64
    estyles
    -0.60
    Pred
    -0.58
    farious
    -0.57
     outper
    -0.57
     behavi
    -0.56
     dominates
    -0.56
     pred
    -0.56
    alth
    -0.55
    POSITIVE LOGITS
     welcome
    0.79
    !!!!!
    0.79
     gracious
    0.76
     sir
    0.76
     thank
    0.72
     kindly
    0.72
     goodbye
    0.72
     Thank
    0.71
     generous
    0.70
     hello
    0.67
    Act Density 0.021%

    No Known Activations