INDEX
    Explanations

    instances of self-contradiction and arguments about morality and values

    New Auto-Interp
    Negative Logits
    erp
    -0.16
     Morales
    -0.16
    ermen
    -0.16
    sov
    -0.16
    dle
    -0.15
    IRT
    -0.15
     expend
    -0.14
    еÑĢп
    -0.14
    ç´
    -0.14
    loe
    -0.14
    POSITIVE LOGITS
    gfx
    0.15
    opia
    0.14
    ano
    0.14
    clip
    0.14
     Alic
    0.14
     Pl
    0.14
     Escape
    0.14
     Tu
    0.14
    Escape
    0.14
    amel
    0.14
    Act Density 0.422%

    No Known Activations