INDEX
    Explanations

    references to cultural and social structures or norms

    New Auto-Interp
    Negative Logits
    ussy
    -0.15
    erosis
    -0.14
    -the
    -0.14
    ITS
    -0.14
    oz
    -0.14
    ieten
    -0.14
     TypeInfo
    -0.14
     Gott
    -0.14
    ))-
    -0.14
    peg
    -0.13
    POSITIVE LOGITS
    —to
    0.20
    —for
    0.19
    —in
    0.17
    —that
    0.17
    --
    0.16
    §
    0.15
     toy
    0.15
    -than
    0.15
    ,on
    0.15
    —as
    0.15
    Act Density 0.038%

    No Known Activations