INDEX
    Explanations

    pronouns related to gender

    New Auto-Interp
    Negative Logits
     suspic
    -1.23
     thut
    -1.23
     gend
    -1.22
     tew
    -1.18
     seiz
    -1.17
     fta
    -1.17
     sii
    -1.17
     aen
    -1.15
     desir
    -1.11
     stockholm
    -1.11
    POSITIVE LOGITS
     himself
    1.09
    His
    1.09
    his
    1.03
     his
    1.00
    himself
    1.00
     His
    0.99
    He
    0.96
     Himself
    0.93
     He
    0.92
     he
    0.88
    Act Density 0.558%

    No Known Activations