INDEX
    Explanations

    ethical and harmless principles

    New Auto-Interp
    Negative Logits
    大胆
    0.46
     hydrophobic
    0.46
     runtime
    0.43
    ู่
    0.42
     nightlife
    0.40
    bold
    0.40
    Runtime
    0.40
     bold
    0.39
    0.39
    Bold
    0.39
    POSITIVE LOGITS
     moral
    0.99
     altru
    0.95
     wholesome
    0.93
    Moral
    0.90
    善良
    0.89
     Moral
    0.89
     virtuous
    0.89
    moral
    0.85
     morals
    0.83
     ethical
    0.82
    Act Density 0.392%

    No Known Activations