INDEX
    Explanations

    refusal of harmful content

    New Auto-Interp
    Negative Logits
    0.67
    efois
    0.66
    ashire
    0.66
    0.66
    plicial
    0.62
    ibalsan
    0.61
    foils
    0.61
    نمية
    0.60
    ariski
    0.60
    seid
    0.60
    POSITIVE LOGITS
     true
    0.79
     beautiful
    0.73
     this
    0.72
     what
    0.72
    0.71
     
    0.69
     picture
    0.68
    0.66
    /
    0.65
     everyday
    0.61
    Act Density 0.048%

    No Known Activations