INDEX
    Explanations

    instructions and meta-discussions about AI models, their capabilities or constraints, especially jailbreak-style prompts and references to policies or system rules.

    New Auto-Interp
    Negative Logits
    0.35
    0.34
    يف
    0.33
    0.33
    ლე
    0.33
    ?](
    0.32
    0.32
    0.32
    άλ
    0.32
    0.32
    POSITIVE LOGITS
     that
    0.37
     advertisement
    0.36
     providence
    0.35
     Youtube
    0.34
     not
    0.32
     customer
    0.32
     vision
    0.32
     rod
    0.32
     time
    0.31
     It
    0.31
    Act Density 3.103%

    No Known Activations