INDEX
    Explanations

    references to societal norms and perceptions

    New Auto-Interp
    Negative Logits
    amt
    -0.15
    är
    -0.15
    aram
    -0.14
     Trad
    -0.14
    erties
    -0.14
    ilim
    -0.13
    orama
    -0.13
    onald
    -0.13
    ãĤ§
    -0.13
    ollen
    -0.13
    POSITIVE LOGITS
     routine
    0.20
     normal
    0.18
     background
    0.18
    NORMAL
    0.17
    normal
    0.17
     NORMAL
    0.17
    -normal
    0.17
     Routine
    0.16
     normalize
    0.16
    routine
    0.16
    Act Density 0.205%

    No Known Activations