INDEX
    Explanations

    references to stereotypes and biases

    New Auto-Interp
    Negative Logits
    ico
    -0.18
    elan
    -0.17
    iero
    -0.17
    sik
    -0.17
    rai
    -0.17
    ayo
    -0.15
     Um
    -0.15
    eco
    -0.15
    idine
    -0.15
    nels
    -0.15
    POSITIVE LOGITS
    embr
    0.16
    598
    0.16
       
    0.14
    rez
    0.14
     snap
    0.14
    zos
    0.14
    .snap
    0.13
    acus
    0.13
    aws
    0.13
     con
    0.13
    Act Density 0.145%

    No Known Activations