INDEX

Explanations

act quickly, following tests, after being

np_acts-logits-general · gemini-2.5-flash-lite

The neuron flags first‐person confessional or apology language, e.g. personal admissions of poor judgment or misconduct.

oai_token-act-pair · o4-mini Triggered by @jyhe0408

directly quoted, formal statements—press-release or spokesperson/official remarks enclosed in quotation marks.

oai_token-act-pair · gpt-5 Triggered by @jyhe0408

formal statements or announcements, particularly those using diplomatic or corporate language to explain decisions, actions, or policies.

oai_token-act-pair · claude-4-5-sonnet Triggered by @jyhe0408

New Auto-Interp

Configuration

google/gemma-scope-27b-pt-res/layer_10/width_131k

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 fraîche

-1.43

 sembla

-1.41

 fraiche

-1.37

 chaude

-1.31

ぼり

-1.24

膂

-1.23

 tropicales

-1.23

 brazos

-1.23

 getLayout

-1.23

ろし

-1.22

POSITIVE LOGITS

 your

1.75

its

1.63

or

1.43

 will

1.40

 probablemente

1.37

 onderwerp

1.30

ITECH

1.28

始

1.28

></

1.27

vyšší

1.23

Activations Density 0.101%

act quickly, following tests, after being

The neuron flags first‐person confessional or apology language, e.g. personal admissions of poor judgment or misconduct.

directly quoted, formal statements—press-release or spokesperson/official remarks enclosed in quotation marks.

formal statements or announcements, particularly those using diplomatic or corporate language to explain decisions, actions, or policies.

No Comments

No Known Activations

act quickly, following tests, after being

The neuron flags first‐person confessional or apology language, e.g. personal admissions of poor judgment or misconduct.

directly quoted, formal statements—press-release or spokesperson/official remarks enclosed in quotation marks.

formal statements or announcements, particularly those using diplomatic or corporate language to explain decisions, actions, or policies.

No Comments

No Known Activations