INDEX
    Explanations

    references to societal norms regarding gender roles, particularly in relation to appearance and behavior

    New Auto-Interp
    Negative Logits
    atan
    -0.16
    .construct
    -0.15
     Hlav
    -0.15
     sir
    -0.14
    iegel
    -0.14
    611
    -0.14
    225
    -0.14
    æ´²
    -0.13
    arat
    -0.13
    511
    -0.13
    POSITIVE LOGITS
     superv
    0.19
    rought
    0.16
    ellij
    0.16
    loth
    0.15
    simp
    0.15
     célib
    0.15
     tokens
    0.15
     Straw
    0.15
     simples
    0.14
    disposed
    0.14
    Act Density 0.049%

    No Known Activations