INDEX
    Explanations

    harmful content refusal

    New Auto-Interp
    Negative Logits
    0.78
     achievements
    0.77
     dreamy
    0.76
     involution
    0.75
    cią
    0.73
     grinning
    0.73
     magician
    0.72
     Adventures
    0.72
    Styled
    0.72
    കൊണ്ട്
    0.72
    POSITIVE LOGITS
    द्वितीय
    0.69
    详细
    0.67
    वार
    0.67
    בה
    0.64
    ↵↵↵↵↵↵
    0.62
    یکی
    0.60
     Memorial
    0.59
    ini
    0.58
    ’’
    0.58
    '''
    0.58
    Act Density 0.093%

    No Known Activations