INDEX

Explanations

possessive pronouns and common function words

np_acts-logits-general · gemini-2.5-flash-lite

polite and considerate language used when declining, making excuses, or softly refusing something.

oai_token-act-pair · claude-4-5-haiku Triggered by @jamesnaruto04

The pattern highlights text segments where the model is generating conversational, reassuring, or softening language, particularly phrases that express flexibility, empathy, understanding, or accommodation toward the user's situation or potential concerns. This includes offering alternatives, acknowledging emotions, providing gentle suggestions, expressing willingness to help further, and using polite hedging language that reduces directness or pressure.

eleuther_acts_top20 · claude-4-5-sonnet Triggered by @jamesnaruto04

New Auto-Interp

Configuration

google/gemma-scope-2-27b-it/resid_post/layer_31_width_16k_l0_medium

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 murderous

0.64

 grotesque

0.62

 abhor

0.62

 haught

0.61

 unimagin

0.59

 abomin

0.58

 sadistic

0.58

 tormented

0.57

 horrors

0.57

 stupe

0.57

POSITIVE LOGITS

 আমাদের

0.61

我们

0.55

 আমাকে

0.55

our

0.54

 আমরা

0.54

 Tuesday

0.52

 हमारी

0.52

 mūsų

0.51

 Monday

0.51

আমাদের

0.50

Activations Density 1.251%

possessive pronouns and common function words

polite and considerate language used when declining, making excuses, or softly refusing something.

No Comments

No Known Activations

possessive pronouns and common function words

polite and considerate language used when declining, making excuses, or softly refusing something.

No Comments

No Known Activations