INDEX
    Explanations

    model refusal for unsafe content

    New Auto-Interp
    Negative Logits
    appareil
    0.38
    iatric
    0.38
    خیص
    0.38
     ابنائي
    0.37
     мои
    0.37
     benim
    0.37
     моих
    0.37
     meiner
    0.36
     dennoch
    0.36
    اعه
    0.36
    POSITIVE LOGITS
     sorry
    0.60
    Sorry
    0.54
    sorry
    0.50
     Sorry
    0.49
    Click
    0.46
    Disclaimer
    0.45
    Title
    0.43
    Content
    0.40
    Featuring
    0.40
    Please
    0.38
    Act Density 0.010%

    No Known Activations