INDEX
    Explanations

    specific words or phrases associated with identities or cultural markers, particularly those related to ethnicity or heritage

    New Auto-Interp
    Negative Logits
    ãģªãĤĭ
    -0.17
    abant
    -0.15
    rane
    -0.15
     reb
    -0.14
    quito
    -0.14
     Mug
    -0.14
    incr
    -0.14
     Murdoch
    -0.13
    brook
    -0.13
    ستاÙĨ
    -0.13
    POSITIVE LOGITS
    elow
    0.15
    hood
    0.15
    à¥Ĥद
    0.15
    etical
    0.15
     ëĭ¹
    0.15
    riba
    0.14
    amb
    0.14
     kadar
    0.14
    fully
    0.14
    eldon
    0.14
    Act Density 0.047%

    No Known Activations