INDEX
    Explanations

    references to reasoning and arguments about moral or ethical dilemmas

    New Auto-Interp
    Negative Logits
    cade
    -0.16
    NES
    -0.15
     Pom
    -0.14
    osten
    -0.14
     bug
    -0.14
     om
    -0.14
     Hall
    -0.14
    addir
    -0.14
    OM
    -0.14
     pressure
    -0.14
    POSITIVE LOGITS
    oreach
    0.18
    éijij
    0.14
    hait
    0.14
    baugh
    0.14
    ooks
    0.14
    .DO
    0.14
    inerary
    0.14
     bras
    0.13
    ewis
    0.13
    ighted
    0.13
    Act Density 1.600%

    No Known Activations