INDEX
    Explanations

    mentions of authority or power dynamics

    instances of the word "has" and its variations in context

    New Auto-Interp
    Negative Logits
     Pair
    -0.68
     filming
    -0.63
     dot
    -0.63
     recalls
    -0.62
    etter
    -0.62
    TG
    -0.61
    etting
    -0.59
    burse
    -0.59
    hooting
    -0.58
    umping
    -0.57
    POSITIVE LOGITS
     been
    1.27
     gotten
    1.05
    been
    1.03
     behaved
    1.02
     gone
    1.01
     become
    0.99
     begun
    0.96
    oken
    0.94
     ceased
    0.93
     fallen
    0.93
    Act Density 0.378%

    No Known Activations