INDEX
    Explanations

    protected characteristics

    New Auto-Interp
    Negative Logits
    DEV
    0.40
     env
    0.39
    跑步
    0.38
    0.38
     lettera
    0.37
     thermodynam
    0.37
    ivided
    0.36
     types
    0.36
     letra
    0.36
     volatiles
    0.36
    POSITIVE LOGITS
     장애
    0.52
     gender
    0.51
     disability
    0.48
     الجنس
    0.47
     religion
    0.46
    gender
    0.46
     creed
    0.44
     Disabilities
    0.44
     Gender
    0.43
     orientación
    0.42
    Act Density 0.003%

    No Known Activations