INDEX
    Explanations

    expressions of happiness or satisfaction

    expressions of gratitude or happiness

    New Auto-Interp
    Negative Logits
     helicop
    -0.76
    effic
    -0.73
    artifacts
    -0.72
    improve
    -0.69
    çīĪ
    -0.67
     contam
    -0.67
     impair
    -0.66
    Improve
    -0.65
    cend
    -0.65
    irrel
    -0.65
    POSITIVE LOGITS
     glad
    0.86
     Tid
    0.79
    ness
    0.77
    dy
    0.76
    terday
    0.72
    ा
    0.71
    joy
    0.71
     Sonia
    0.70
     tid
    0.69
    imar
    0.68
    Act Density 0.018%

    No Known Activations