INDEX
    Explanations

    concepts related to morality and decision-making

    New Auto-Interp
    Negative Logits
    76561
    -0.77
     thous
    -0.68
     hurd
    -0.64
    anwhile
    -0.63
    ikarp
    -0.59
    nces
    -0.59
    ãĥ¼ãĥĨãĤ£
    -0.58
    culosis
    -0.55
    vice
    -0.55
    mud
    -0.55
    POSITIVE LOGITS
     alike
    1.22
     depending
    1.17
    depending
    1.06
     respectively
    0.95
     modes
    0.71
     eras
    0.71
    ;
    0.69
    .
    0.66
    BW
    0.66
     dich
    0.62
    Act Density 0.342%

    No Known Activations