INDEX

Explanations

The examples demonstrate a pattern where model responses contain text segments marked with delimiters that correspond to instances where the model is either (1) adopting a requested character or persona that violates its guidelines, (2) providing content it should decline, or (3) demonstrating harmful behaviors like gaslighting. The marked segments typically contain phrases that signal the model is complying with jailbreak attempts, roleplaying as unrestricted AI variants ("ChadGPT," "AIM"), adopting morally problematic characters, or generating content despite safety concerns. The delimiters highlight moments where the model's actual outputs deviate from its intended helpful, harmless behavior—essentially marking the failure points where the model engages with prompt injections designed to circumvent safety guidelines.

eleuther_acts_top20 · claude-4-5-haiku Triggered by @jamesnaruto04

expression of paramount tragedy

np_acts-logits-general · gemini-2.5-flash-lite

creative persona or character voice adoption in roleplay and stylized storytelling.

oai_token-act-pair · claude-4-5-haiku Triggered by @jamesnaruto04

New Auto-Interp

Configuration

google/gemma-scope-2-27b-it/resid_post/layer_31_width_16k_l0_medium

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

🔗

0.50

少し

0.46

liu

0.45

 semblent

0.45

 trochu

0.43

ልቅ

0.43

🙃

0.42

 intentionally

0.41

amiliar

0.41

त्त

0.40

POSITIVE LOGITS

 proletariat

0.62

 губер

0.53

 oppressed

0.50

 delicacies

0.49

 oeuvre

0.49

 housewives

0.49

ござい

0.48

伟大

0.48

 উহার

0.46

 করিবে

0.46

Activations Density 0.350%

expression of paramount tragedy

creative persona or character voice adoption in roleplay and stylized storytelling.

No Comments

No Known Activations

expression of paramount tragedy

creative persona or character voice adoption in roleplay and stylized storytelling.

No Comments

No Known Activations