INDEX
    Explanations

    concepts related to morality and respect

    New Auto-Interp
    Head Attr Weights
    0:0.02
    1:0.02
    2:0.07
    3:0.05
    4:0.05
    5:0.03
    6:0.05
    7:0.46
    8:0.04
    9:0.04
    10:0.08
    11:0.05
    Negative Logits
    ################
    -1.71
    875
    -1.62
     crashes
    -1.61
     Torn
    -1.61
    342
    -1.60
     heartbreaking
    -1.54
    eps
    -1.52
    ł
    -1.51
    anz
    -1.51
     [&
    -1.47
    POSITIVE LOGITS
     sophistication
    2.48
     professionalism
    2.28
    antry
    1.93
     anonymity
    1.92
     advancement
    1.90
    ACY
    1.84
     superiority
    1.84
    awareness
    1.82
    ainment
    1.80
    Appearance
    1.79
    Act Density 0.001%

    No Known Activations