INDEX
    Explanations

    concepts related to morality and ethical behavior

    New Auto-Interp
    Negative Logits
    zen
    -0.16
    ej
    -0.14
     Barr
    -0.14
     ner
    -0.13
    usi
    -0.13
     Platt
    -0.13
     complied
    -0.13
    pla
    -0.13
    476
    -0.13
     éļ
    -0.13
    POSITIVE LOGITS
    anke
    0.15
    isches
    0.14
    enty
    0.14
    ARRANT
    0.14
    airo
    0.14
    ána
    0.13
    æĪIJ人
    0.13
    å¯Ħ
    0.13
    лаз
    0.13
    entral
    0.13
    Act Density 0.750%

    No Known Activations