INDEX
    Explanations

    phrases related to morality and ethics

    New Auto-Interp
    Negative Logits
    eyer
    -0.17
    VERRIDE
    -0.16
    odon
    -0.15
    OD
    -0.14
    zon
    -0.14
     Dove
    -0.14
    adal
    -0.14
     Accept
    -0.14
    .respond
    -0.14
     Clear
    -0.14
    POSITIVE LOGITS
     statement
    0.18
     correct
    0.17
     statements
    0.17
    ãĤ¤ãĤº
    0.15
    correct
    0.15
     guts
    0.15
    pler
    0.14
    дж
    0.14
    ntl
    0.14
    说çļĦ
    0.14
    Act Density 0.349%

    No Known Activations