INDEX
    Explanations

    phrases expressing negation and clarification

    New Auto-Interp
    Negative Logits
     zwar
    -0.16
    odied
    -0.15
    اع
    -0.14
    æ¬ł
    -0.14
    един
    -0.14
    emory
    -0.14
    airo
    -0.14
    vice
    -0.14
    olkien
    -0.14
    elin
    -0.13
    POSITIVE LOGITS
    è¿ĺæĺ¯
    0.17
     nonetheless
    0.15
    tera
    0.15
    GLOSS
    0.15
    iew
    0.15
    tering
    0.15
    ully
    0.14
    tiles
    0.14
    .geo
    0.14
     elsewhere
    0.14
    Act Density 0.158%

    No Known Activations