INDEX
    Explanations

    references to a specific pronoun for individuals, particularly focusing on their actions and statements

    New Auto-Interp
    Negative Logits
    noon
    -0.75
    acters
    -0.69
    rocket
    -0.69
    iencies
    -0.65
    NAT
    -0.63
     disabling
    -0.62
     menstrual
    -0.62
     Measure
    -0.61
    berra
    -0.59
    atible
    -0.59
    POSITIVE LOGITS
     said
    1.18
     replied
    1.16
     wrote
    1.15
     exclaimed
    1.14
     joked
    1.09
     laughed
    1.04
     remarked
    1.03
     says
    1.03
     tweeted
    1.02
    said
    1.02
    Act Density 0.044%

    No Known Activations