INDEX
    Explanations

    words related to representation and social or ethical issues

    New Auto-Interp
    Negative Logits
    o
    -0.24
    at
    -0.23
    ains
    -0.20
    ay
    -0.19
    im
    -0.19
    ings
    -0.18
    i
    -0.18
    an
    -0.17
    ives
    -0.16
    ap
    -0.16
    POSITIVE LOGITS
    uze
    0.16
    ếu
    0.16
    eren
    0.16
    æĹıèĩªæ²»
    0.15
    )((((
    0.15
    amage
    0.14
    Ä©
    0.14
    Ế
    0.14
    kowski
    0.14
    imit
    0.14
    Act Density 0.056%

    No Known Activations