INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     thank
    -2.17
    thank
    -1.95
     Thank
    -1.87
    Thank
    -1.84
     thanked
    -1.79
     THANK
    -1.76
     thanking
    -1.72
     thankful
    -1.62
    THANK
    -1.61
     Спасибо
    -1.53
    POSITIVE LOGITS
     to
    0.64
    ful
    0.64
     for
    0.59
     you
    0.55
    fully
    0.53
    ingly
    0.53
     "
    0.48
     God
    0.48
    fulness
    0.48
    0.46
    Act Density 0.107%

    No Known Activations