INDEX
    Explanations

    references to appearance or visual assessment

    New Auto-Interp
    Negative Logits
    utra
    -0.20
    874
    -0.16
     somew
    -0.15
    нки
    -0.14
     somewhere
    -0.14
    iar
    -0.14
    chor
    -0.14
    ugar
    -0.14
     Carp
    -0.13
    ovation
    -0.13
    POSITIVE LOGITS
     like
    0.51
    like
    0.44
     Like
    0.43
    Like
    0.42
     LIKE
    0.38
    LIKE
    0.37
    -like
    0.36
     likes
    0.35
    _like
    0.33
    .like
    0.31
    Act Density 0.013%

    No Known Activations