INDEX
    Explanations

    phrases that highlight moral or ethical dilemmas related to harm and responsibility

    New Auto-Interp
    Negative Logits
    dera
    -0.20
    anki
    -0.16
    enson
    -0.15
    ifornia
    -0.14
    520
    -0.14
     curtains
    -0.14
    cke
    -0.13
    arsing
    -0.13
    ocket
    -0.13
     aud
    -0.13
    POSITIVE LOGITS
     nor
    0.23
     anymore
    0.20
     ANY
    0.16
    nor
    0.15
     anybody
    0.14
    licht
    0.14
    idia
    0.14
     newPosition
    0.14
    νοÏį
    0.14
     any
    0.14
    Act Density 0.213%

    No Known Activations