INDEX
    Explanations

    references to female characters or pronouns

    New Auto-Interp
    Negative Logits
     Monfieur
    -0.91
     raiſ
    -0.87
     Efq
    -0.87
     Houſe
    -0.85
     againſt
    -0.83
     cauſe
    -0.81
     uſe
    -0.79
     itſelf
    -0.77
     purpoſe
    -0.76
    נטרנט
    -0.76
    POSITIVE LOGITS
     her
    1.79
     his
    1.51
     Her
    1.43
     HER
    1.33
    her
    1.29
    Her
    1.28
     she
    1.26
     His
    1.17
     HIS
    1.15
    she
    1.09
    Act Density 0.148%

    No Known Activations