INDEX

Explanations

what does"}, {"link": "authorization link"}, {"rocket": "rocket still"}, {"Arguments": "arguments, evidence"}, {"within": "within NLP"}, {"Drug": "drug-like properties"}The neuron seems to be activated by phrases that introduce an explanation, a question, or a definition.Examples:- "What does..."- "authorization link"- "rocket still"- "arguments, evidence, and nuances"- "within NLP"- "drug-like properties"- "Arguments Used"- "corresponds"The structure "What [X] does" is explicitly present."rocket still" is present."Arguments Used" is present."within NLP" is present.The common theme is introducing a topic or a specific element and then continuing to discuss it.The first max activating token is "What". The token after that is "does". This forms "What does".The texts show a pattern of question/introduction followed by explanation.A concise phrase could be "introducing explanations or questions".Or simply "what does" as a prominent example.Let's look for a more general pattern covering these."what does" is a very strong candidate."introducing a topic""explaining a concept""providing definitions or arguments"Considering the `MAX_ACTIVATING_TOKENS`: "What", "link", "rocket", "Arguments", "within".Considering `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`: "does", "can", "Att", "Risks", "still", "NLP", "Drug", "Used", "the", "corresponds".Pairs:What doeslink canrocket stillArguments Attwithin NLPArguments UsedThe structure "What does" is very direct.The phrase "What does" itself is 2 words. It falls within 3-20 words.Let's look at TOP_POSITIVE_LOGITS. It has "listOf". This suggests it might be related to lists or structured data.The `TOP_ACTIVATING_TEXTS` contain structure like bullet points or numbered lists of arguments/risks/properties.The neuron seems to activate around introducing a topic or item that will be further described or quantified.- "What does" introduces a question/topic.- "link" can introduce an authorization link. what does","rocket still","Arguments Used","within NLP

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 Worksheet

0.41

 twentieth

0.41

 spreadsheet

0.40

 texto

0.40

 Trung

0.40

ចប់

0.40

 संदर्भ

0.40

 conve

0.39

 institucional

0.38

 inadequate

0.38

POSITIVE LOGITS

нях

0.43

 ਹੈ

0.40

Foo

0.40

 بهره

0.40

IsDir

0.39

 passionately

0.39

üh

0.38

чё

0.38

 esistono

0.38

 listOf

0.37

Activations Density 0.005%