INDEX
    Explanations

    references to social roles and group identities

    New Auto-Interp
    Negative Logits
     Halk
    -0.15
    olik
    -0.15
    round
    -0.15
    _UNUSED
    -0.14
    UTH
    -0.14
    heels
    -0.14
    VICES
    -0.13
    xico
    -0.13
     both
    -0.13
    Wide
    -0.13
    POSITIVE LOGITS
    lint
    0.17
    ynos
    0.16
    azen
    0.16
    engin
    0.15
    .cod
    0.14
    åıį
    0.14
    bÃŃ
    0.14
    ños
    0.14
    ارد
    0.14
    _FONT
    0.13
    Act Density 0.079%

    No Known Activations