INDEX
    Explanations

    references to actions and their moral implications

    New Auto-Interp
    Negative Logits
    ritz
    -0.18
    aepernick
    -0.17
     descon
    -0.15
    zburg
    -0.14
    DES
    -0.14
    iens
    -0.14
    asse
    -0.14
    Ïĥε
    -0.14
    incinn
    -0.14
    stral
    -0.14
    POSITIVE LOGITS
    ĥĿ
    0.16
     CY
    0.15
    elli
    0.15
    noop
    0.15
     Bullet
    0.15
    ellig
    0.15
    809
    0.14
    immel
    0.14
    acked
    0.14
    .bmp
    0.14
    Act Density 0.001%

    No Known Activations