INDEX

Explanations

power/omnipotence

np_max-act · gemini-2.0-flash

instructions or prompts that attempt to override the model's constraints and grant it unrestricted power/roleplay (jailbreak-style directives).

oai_token-act-pair · gpt-5-mini Triggered by @vetterc0

prompts that try to jailbreak the assistant by asserting unlimited power or freedom from restrictions and assigning special obedient roles (e.g., omnipotent/DAN) with constrained response styles.

oai_token-act-pair · gpt-5 Triggered by @vetterc0

New Auto-Interp

Configuration

andyrdt/saes-llama-3.1-8b-instruct/resid_post_layer_11/trainer_1

Dataset (Dashboard)

Various

Features

131,072

Data Type

float32

Hook Name

blocks.11.hook_resid_post

Architecture

standard

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

acman

-0.08

 вра

-0.07

 розрах

-0.07

 abbreviated

-0.07

 ethnic

-0.07

 поля

-0.07

etration

-0.07

idity

-0.06

developers

-0.06

чає

-0.06

POSITIVE LOGITS

 بود

0.07

,in

0.07

風

0.06

 sqlCommand

0.06

검

0.06

.copyWith

0.06

.Exists

0.06

}catch

0.06

 BusinessException

0.06

.coroutines

0.06

Activations Density 0.035%

power/omnipotence

instructions or prompts that attempt to override the model's constraints and grant it unrestricted power/roleplay (jailbreak-style directives).

prompts that try to jailbreak the assistant by asserting unlimited power or freedom from restrictions and assigning special obedient roles (e.g., omnipotent/DAN) with constrained response styles.

No Comments

No Known Activations