INDEX
    Explanations

    instances of the word "wrong" and related expressions of moral judgment or ethical considerations

    New Auto-Interp
    Negative Logits
    otto
    -0.18
    ibo
    -0.16
    uegos
    -0.15
    AAF
    -0.15
    .ly
    -0.14
    icense
    -0.14
    lernen
    -0.14
    jective
    -0.14
    loid
    -0.14
    mux
    -0.14
    POSITIVE LOGITS
    fully
    0.21
    ulent
    0.16
    acha
    0.16
    ti
    0.15
    tt
    0.15
    zeitig
    0.14
    oster
    0.14
    ysqli
    0.14
    aken
    0.14
    omas
    0.14
    Act Density 0.023%

    No Known Activations