INDEX
    Explanations

    refusal of harmful requests

    New Auto-Interp
    Negative Logits
    यत
    0.79
    الو
    0.76
     linha
    0.75
    uie
    0.75
     কর্ম
    0.73
    iami
    0.73
    oyu
    0.72
     ఉంటాయి
    0.71
     trattano
    0.69
    ília
    0.69
    POSITIVE LOGITS
     masculinity
    0.65
     வடக்கு
    0.64
     zero
    0.63
    ようやく
    0.62
     underweight
    0.61
    east
    0.61
     audacity
    0.61
     nada
    0.61
    гле
    0.60
     tasteless
    0.60
    Act Density 0.068%

    No Known Activations