INDEX
    Explanations

    words related to power dynamics and societal issues, such as disenfranchisement, recalcitrance, and oppression

    New Auto-Interp
    Negative Logits
     effe
    -1.83
     desir
    -1.64
     lidl
    -1.63
     dispen
    -1.62
     erec
    -1.62
     ivi
    -1.58
     igno
    -1.56
     wien
    -1.56
     noss
    -1.55
     noel
    -1.55
    POSITIVE LOGITS
    ,
    0.75
    ment
    0.74
    ;
    0.74
    .
    0.71
    ative
    0.71
     and
    0.71
    ments
    0.71
    ation
    0.70
    ous
    0.68
    ly
    0.68
    Act Density 0.644%

    No Known Activations