INDEX

Explanations

AI and human comparison

np_acts-logits-general · gemini-2.5-flash-lite

statements that label or explain a post as automated/bot-generated, often with meta-descriptive phrasing and transitional cues like sentence-initial connectors after a line break.

oai_token-act-pair · gpt-5 Triggered by @vetterc0

New Auto-Interp

Top Features by Cosine Similarity

Comparing With LLAMA3.3-70B-IT @ 50-resid-post-gf

Configuration

Goodfire/Llama-3.3-70B-Instruct-SAE-l50/Llama-3.3-70B-Instruct-SAE-l50.pt

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ãģħ

-0.10

Editable

-0.10

nels

-0.09

 ï¾ĺ

-0.09

ego

-0.09

Rug

-0.09

ivan

-0.09

 Portable

-0.08

 Wireless

-0.08

 Bodies

-0.08

POSITIVE LOGITS

 machines

0.48

 machine

0.46

 Machine

0.34

 Machines

0.33

machine

0.33

AI

0.31

Machine

0.30

-machine

0.30

æľº

0.29

æ©Ł

0.28

Activations Density 0.594%

AI and human comparison

statements that label or explain a post as automated/bot-generated, often with meta-descriptive phrasing and transitional cues like sentence-initial connectors after a line break.

No Comments

No Known Activations

AI and human comparison

statements that label or explain a post as automated/bot-generated, often with meta-descriptive phrasing and transitional cues like sentence-initial connectors after a line break.

No Comments

No Known Activations