INDEX
    Explanations

    phrases related to classification and categorization

    New Auto-Interp
    Negative Logits
    undra
    -0.18
    sth
    -0.17
    acock
    -0.17
    rieb
    -0.15
    verbatim
    -0.14
    inston
    -0.14
    cela
    -0.14
    ohen
    -0.14
    kü
    -0.14
    asename
    -0.13
    POSITIVE LOGITS
    ness
    0.23
    ifi
    0.19
    ifying
    0.16
    -looking
    0.15
    utter
    0.14
    izz
    0.14
    ly
    0.14
    ewing
    0.14
    NESS
    0.13
    ifiable
    0.13
    Act Density 0.007%

    No Known Activations