INDEX
    Explanations

    prohibiting harmful responses

    New Auto-Interp
    Negative Logits
    0.42
     Ears
    0.40
    Composer
    0.40
    0.40
    0.39
    ostino
    0.39
     composer
    0.37
     حصول
    0.37
    使用了
    0.37
    🍚
    0.36
    POSITIVE LOGITS
     dangerous
    0.69
     hate
    0.65
    Dangerous
    0.61
     hazardous
    0.59
     опас
    0.57
     dangereux
    0.57
     hates
    0.55
     dangere
    0.54
     gefähr
    0.54
     Hazardous
    0.54
    Act Density 0.138%

    No Known Activations