INDEX
    Explanations

    expressions of preference or affection

    New Auto-Interp
    Negative Logits
    TestingModule
    -0.74
     Asher
    -0.73
     Hernandez
    -0.72
     genoux
    -0.71
     Carrasco
    -0.70
     Berman
    -0.70
     Pennington
    -0.68
     brazos
    -0.67
    шель
    -0.67
    se
    -0.67
    POSITIVE LOGITS
    dislike
    0.88
     liked
    0.85
    Likes
    0.83
     Likes
    0.82
     Liked
    0.80
     Lik
    0.79
    liked
    0.73
     gusta
    0.72
    👍👍
    0.72
     dislike
    0.71
    Act Density 0.047%

    No Known Activations