INDEX
    Explanations

    expressions of empathy or sympathy

    expressions of apology or regret

    New Auto-Interp
    Negative Logits
    minecraft
    -0.79
    sports
    -0.77
    ouver
    -0.73
    tein
    -0.71
    rouse
    -0.70
    hack
    -0.68
    tnc
    -0.68
    authorized
    -0.67
    craft
    -0.66
    impro
    -0.64
    POSITIVE LOGITS
     sorry
    1.37
     Sorry
    1.00
    Sorry
    0.94
    sorry
    0.94
     excuse
    0.88
     pardon
    0.84
     apologies
    0.78
    soever
    0.77
     THANK
    0.76
    ा
    0.76
    Act Density 0.007%

    No Known Activations