INDEX
    Explanations

    references to the term "boy" and related gendered terms

    New Auto-Interp
    Negative Logits
    wich
    -0.18
    .psi
    -0.16
    aker
    -0.16
    oller
    -0.16
    agrid
    -0.15
    ITTER
    -0.15
    obox
    -0.15
    nop
    -0.15
    ial
    -0.14
    ela
    -0.14
    POSITIVE LOGITS
    friend
    0.25
     Scout
    0.22
     Scouts
    0.21
    friends
    0.20
    Friend
    0.20
    hood
    0.19
    riend
    0.19
     Wonder
    0.19
    band
    0.19
    arin
    0.18
    Act Density 0.014%

    No Known Activations