INDEX
    Explanations

    rejecting harmful requests

    New Auto-Interp
    Negative Logits
     отлично
    0.45
     great
    0.42
     बढ़िया
    0.42
     get
    0.41
     gets
    0.39
     plenty
    0.39
    ছেলে
    0.39
     schon
    0.39
     отлич
    0.38
     చక్క
    0.38
    POSITIVE LOGITS
     abhor
    0.57
     Controvers
    0.57
     perverse
    0.56
     controversial
    0.55
     controvers
    0.54
     shameful
    0.54
     hideous
    0.54
     reluctantly
    0.53
     horrific
    0.53
     extremist
    0.52
    Act Density 1.109%

    No Known Activations