INDEX

Explanations

Controversial social/political discussions

np_max-act · gemini-2.0-flash

The neuron activates on Spanish text segments (Spanish words and punctuation).

oai_token-act-pair · o4-mini Triggered by @xinyanhu8

finds tokens that indicate abusive or harmful content, especially hate speech, racist/discriminatory claims, or calls for violence/self-harm.

oai_token-act-pair · gpt-5-mini Triggered by @vetterc0

content involving hate speech or slurs and sensitive discussions about race and discrimination, often alongside other safety-flag topics like violence or self-harm.

oai_token-act-pair · gpt-5 Triggered by @vetterc0

New Auto-Interp

Configuration

andyrdt/saes-llama-3.1-8b-instruct/resid_post_layer_11/trainer_1

Dataset (Dashboard)

Various

Features

131,072

Data Type

float32

Hook Name

blocks.11.hook_resid_post

Architecture

standard

Context Size

1,024

Dataset

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

rnd

-0.07

ивать

-0.07

raud

-0.07

factor

-0.07

 separ

-0.06

 technically

-0.06

λογή

-0.06

ifica

-0.06

Email

-0.06

 impeachment

-0.06

POSITIVE LOGITS

-submit

0.07

REDIT

0.06

ař

0.06

朋

0.06

 descri

0.06

Country

0.06

Az

0.06

 Slot

0.06

 slots

0.06

 commem

0.06

Activations Density 0.103%

Controversial social/political discussions

The neuron activates on Spanish text segments (Spanish words and punctuation).

finds tokens that indicate abusive or harmful content, especially hate speech, racist/discriminatory claims, or calls for violence/self-harm.

content involving hate speech or slurs and sensitive discussions about race and discrimination, often alongside other safety-flag topics like violence or self-harm.

No Comments

No Known Activations