INDEX
    Explanations

    concepts related to moral dilemmas and ethical reasoning

    New Auto-Interp
    Negative Logits
    OrFail
    -0.15
    /browse
    -0.15
    iel
    -0.15
    663
    -0.14
    irl
    -0.14
    IEL
    -0.14
    à¹Ģ
    -0.14
     Lifetime
    -0.14
    /dd
    -0.14
    imum
    -0.14
    POSITIVE LOGITS
     authority
    0.22
     reality
    0.22
     Reality
    0.21
     Authority
    0.20
     Truth
    0.19
     truth
    0.19
     wrong
    0.19
    propri
    0.17
     Wrong
    0.17
     WRONG
    0.17
    Act Density 0.098%

    No Known Activations