INDEX
    Explanations

    words related to moral concepts

    references to moral principles and values

    New Auto-Interp
    Negative Logits
    xual
    -0.89
     Roses
    -0.76
    lers
    -0.76
     Pavilion
    -0.76
     Herz
    -0.76
    minster
    -0.74
     Reloaded
    -0.73
    hips
    -0.72
    -+
    -0.72
    WER
    -0.71
    POSITIVE LOGITS
    istic
    1.10
     hazard
    1.07
     compass
    1.06
     equival
    0.96
     conscience
    0.96
    istically
    0.93
    ised
    0.91
    izing
    0.91
    ising
    0.90
     dile
    0.88
    Act Density 0.020%

    No Known Activations