INDEX
    Explanations

    terms related to human rights and humanity

    New Auto-Interp
    Negative Logits
    gf
    -0.15
    yr
    -0.15
    lying
    -0.15
    oned
    -0.15
    gi
    -0.15
    INCT
    -0.15
    hausen
    -0.15
    yang
    -0.15
    ional
    -0.15
    gers
    -0.14
    POSITIVE LOGITS
    -readable
    0.18
    ized
    0.18
    izing
    0.18
    IGHLIGHT
    0.17
    ENCHMARK
    0.16
    male
    0.16
    pire
    0.15
    istic
    0.15
    itarian
    0.15
    ismatch
    0.15
    Act Density 0.037%

    No Known Activations