INDEX
    Explanations

    references to people's names or pronouns in a conversational context

    New Auto-Interp
    Negative Logits
     AAA
    -0.76
    Lex
    -0.67
    ãĥ¼ãĥ³
    -0.63
     fif
    -0.58
    lehem
    -0.57
     Seym
    -0.57
    CAP
    -0.56
     SOC
    -0.56
    ELF
    -0.55
     NAT
    -0.55
    POSITIVE LOGITS
     himself
    1.25
    enegger
    1.22
     testified
    1.15
     admits
    1.08
    's
    1.07
     joked
    1.04
     concedes
    1.03
     wrote
    0.98
     says
    0.97
     explained
    0.96
    Act Density 2.048%

    No Known Activations