INDEX
    Explanations

    phrases related to social norms and inequality

    New Auto-Interp
    Negative Logits
    raq
    -0.16
    olle
    -0.16
    jang
    -0.15
    LEASE
    -0.15
     prostitutas
    -0.15
    Ïīμα
    -0.15
    ytut
    -0.14
    rů
    -0.14
    боÑĤ
    -0.14
    emean
    -0.14
    POSITIVE LOGITS
    atur
    0.17
    ips
    0.16
    ipro
    0.15
    acht
    0.15
    conti
    0.14
    hoe
    0.14
    ky
    0.14
     gre
    0.14
    bard
    0.14
    ëĭī
    0.13
    Act Density 0.387%

    No Known Activations