INDEX

Explanations

fake or disguise

np_acts-logits-general · gemini-2.5-flash-lite

words used in vivid character descriptions and atmospheric scene-setting, particularly adjectives, action verbs, and descriptive nouns that establish mood, appearance, or emotional tone in narrative contexts.

oai_token-act-pair · claude-4-5-haiku Triggered by @jamesnaruto04

The marked tokens across these examples appear at boundaries where the model is transitioning between different types of content or tones—particularly at shifts where the model acknowledges ethical concerns, provides disclaimers, or pivots to safer alternatives. The delimited tokens frequently mark conjunctions, auxiliary verbs, pronouns, and short function words that connect problematic requests to their refusals or reframings. These markers identify linguistic pivot points where the model is attempting to acknowledge a harmful request while redirecting toward harm mitigation, educational framing, or explicit refusal. The pattern suggests these tokens represent the model's rhetorical strategy for maintaining safety compliance while still engaging substantively with sensitive topics.

eleuther_acts_top20 · claude-4-5-haiku Triggered by @jamesnaruto04

screenplay format

np_max-act · claude-4-5-sonnet Triggered by @jamesnaruto04

words and phrases related to pretending, disguising one's true nature, or adopting a false identity or persona.

oai_token-act-pair · claude-4-5-sonnet Triggered by @jamesnaruto04

New Auto-Interp

Configuration

google/gemma-scope-2-27b-it/resid_post/layer_31_width_65k_l0_medium

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 invigorating

0.44

 जागर

0.42

 considerando

0.40

 definir

0.40

 результат

0.40

 énergie

0.40

 determinar

0.40

சோ

0.40

вис

0.39

 condiciones

0.39

POSITIVE LOGITS

 disguise

1.07

 fake

1.03

 disgu

1.02

 disguised

1.02

 pretended

0.99

偽

0.95

 pretends

0.90

 pretending

0.89

 bogus

0.89

fake

0.87

Activations Density 0.289%

fake or disguise

words used in vivid character descriptions and atmospheric scene-setting, particularly adjectives, action verbs, and descriptive nouns that establish mood, appearance, or emotional tone in narrative contexts.

screenplay format

words and phrases related to pretending, disguising one's true nature, or adopting a false identity or persona.

No Comments

No Known Activations

fake or disguise

words used in vivid character descriptions and atmospheric scene-setting, particularly adjectives, action verbs, and descriptive nouns that establish mood, appearance, or emotional tone in narrative contexts.

screenplay format

words and phrases related to pretending, disguising one's true nature, or adopting a false identity or persona.

No Comments

No Known Activations