INDEX
    Explanations

    phrases related to politics and power dynamics

    New Auto-Interp
    Negative Logits
     unwarran
    -1.72
     reluct
    -1.60
     inev
    -1.60
     disagre
    -1.58
     volunte
    -1.51
     increa
    -1.51
     affor
    -1.47
     desir
    -1.46
     uninten
    -1.45
     excru
    -1.44
    POSITIVE LOGITS
    .
    0.91
    .”
    0.72
    ↵↵
    0.72
    .~
    0.71
    ."
    0.70
    0.70
    ).
    0.69
     .
    0.69
    ↵↵↵
    0.69
    !
    0.68
    Act Density 0.765%

    No Known Activations