INDEX
    Explanations

    expressions of moral judgment or wrongdoing

    New Auto-Interp
    Negative Logits
    antar
    -0.18
    apa
    -0.17
    лами
    -0.16
    اÙĨÙĪ
    -0.16
    uai
    -0.16
    wine
    -0.16
    anki
    -0.15
    traits
    -0.15
    coni
    -0.15
    .qual
    -0.15
    POSITIVE LOGITS
    fully
    0.33
    headed
    0.31
    s
    0.26
    /right
    0.26
    wrong
    0.25
    -headed
    0.25
     wrong
    0.23
     WRONG
    0.23
    Wrong
    0.21
     Wrong
    0.21
    Act Density 0.050%

    No Known Activations