INDEX
    Explanations

    references to societal injustices and moral dilemmas

    New Auto-Interp
    Negative Logits
    bens
    -0.15
     tend
    -0.15
    ocket
    -0.14
    reu
    -0.14
    ullet
    -0.14
    aba
    -0.14
     sab
    -0.14
    åħ¼
    -0.14
    jev
    -0.14
    Explicit
    -0.13
    POSITIVE LOGITS
     stejnÄĽ
    0.29
     same
    0.26
    åIJĮ
    0.25
     similarly
    0.24
     Similarly
    0.23
    Similarly
    0.23
     analogy
    0.22
    Same
    0.22
    ä¸Ģæł·
    0.22
    à¹Ģหม
    0.21
    Act Density 0.279%

    No Known Activations