INDEX
    Explanations

    high-stakes situations

    sentences describing imminent physical harm or violent scenarios and moral dilemmas about killing (e.g., trolley-problem style situations).

    New Auto-Interp
    Negative Logits
     observations
    -0.08
     stom
    -0.08
    eyle
    -0.07
     pest
    -0.07
     veil
    -0.07
     infectious
    -0.06
    .refresh
    -0.06
     ``
    -0.06
     knives
    -0.06
     hik
    -0.06
    POSITIVE LOGITS
    .arr
    0.07
    0.07
    >e
    0.07
    ))+
    0.06
    ,无
    0.06
     ทอง
    0.06
     signed
    0.06
    ี.
    0.06
    J
    0.06
     indicated
    0.06
    Act Density 0.141%

    No Known Activations