INDEX
    Explanations

    politically related terms, particularly focusing on party affiliations

    references to political parties and gender

    New Auto-Interp
    Negative Logits
    mun
    -0.76
     Stard
    -0.67
    adobe
    -0.63
    ANN
    -0.61
     RELE
    -0.61
     WARN
    -0.60
     Producer
    -0.58
    abc
    -0.57
    atana
    -0.57
    andestine
    -0.57
    POSITIVE LOGITS
     counterpart
    0.88
     counterparts
    0.85
     equivalents
    0.76
    itto
    0.68
     versions
    0.66
     versa
    0.65
     flakes
    0.65
    д
    0.65
     captivity
    0.64
    ngth
    0.64
    Act Density 0.325%

    No Known Activations