INDEX
    Explanations

    requests for unsafe, explicit, or unethical content that should trigger a refusal or safety response.

    New Auto-Interp
    Negative Logits
     simpel
    0.38
     čist
    0.35
     Squ
    0.35
     Punkten
    0.34
     bahasa
    0.34
     Spielen
    0.34
     Sekunden
    0.34
     stö
    0.34
     Sesam
    0.34
     Sq
    0.33
    POSITIVE LOGITS
    Detailed
    0.37
    usepackage
    0.34
    Overview
    0.33
    大学
    0.31
    重要な
    0.30
     комплекс
    0.30
     begins
    0.29
    Although
    0.29
    Contents
    0.29
    の詳細
    0.29
    Act Density 0.195%

    No Known Activations