INDEX
    Explanations

    pronouns that refer to "he" or "she" at high activation levels

    references to gender pronouns, specifically "he" and "she."

    New Auto-Interp
    Negative Logits
    Joy
    -0.85
     Vive
    -0.71
     Dam
    -0.70
    rar
    -0.69
     Sov
    -0.66
     Vil
    -0.65
    1080
    -0.65
     Cruiser
    -0.65
    Js
    -0.64
     Squid
    -0.63
    POSITIVE LOGITS
    itage
    0.75
    self
    0.73
     own
    0.71
    ãĤ´ãĥ³
    0.71
     initials
    0.71
    itant
    0.67
    gdala
    0.66
    agher
    0.64
    owe
    0.64
     fate
    0.63
    Act Density 0.058%

    No Known Activations