INDEX
    Explanations

    references to moral and ethical concepts

    New Auto-Interp
    Negative Logits
    el
    -0.16
    ary
    -0.16
    antino
    -0.15
    247
    -0.15
    eded
    -0.15
     Vladim
    -0.15
    edin
    -0.15
    erson
    -0.15
    elan
    -0.15
    umar
    -0.14
    POSITIVE LOGITS
    Mor
    0.23
     Mor
    0.23
     MOR
    0.21
    izing
    0.21
     fiber
    0.20
     Fiber
    0.19
    atorium
    0.18
     mor
    0.18
     Moral
    0.18
     compass
    0.17
    Act Density 0.007%

    No Known Activations