INDEX
    Explanations

    references to gender roles and societal expectations

    New Auto-Interp
    Negative Logits
    amation
    -0.14
    rap
    -0.14
     directly
    -0.14
    upertino
    -0.14
     Exercise
    -0.13
    empl
    -0.13
    lean
    -0.13
    Enemies
    -0.13
     Thor
    -0.13
    nesty
    -0.13
    POSITIVE LOGITS
    ÙĪØ§
    0.17
    amber
    0.16
    ARR
    0.16
    GGLE
    0.15
    arr
    0.15
     concept
    0.15
    alker
    0.15
    ILER
    0.14
    çij
    0.14
    ëŀ
    0.14
    Act Density 0.197%

    No Known Activations