INDEX

Explanations

The user wants a concise explanation (3-20 words) for a neuron's behavior. This explanation should be a phrase, not a full sentence, and should capture what the neuron detects or predicts by finding patterns. I should avoid phrases like "This neuron detects/predicts", "tokens", "patterns", and capitalization unless it's a proper noun.I have the following inputs:- `MAX_ACTIVATING_TOKENS`: " the ", " , ", " ’ "- `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`: "* This\n<end_of_turn>\nfollowing\n1 model\nand\nmodel s"- `TOP_POSITIVE_LOGITS`: Bertha, yuk, тті, общественных, इसरो, obsol, lệnh, unstructured, іх, ക്കേ- `TOP_ACTIVATING_TEXTS`: Examples related to clickbait, story titles, headlines, and some general text.Let's try to find a pattern.1. Analyze `MAX_ACTIVATING_TOKENS` and `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`: - `MAX_ACTIVATING_TOKENS`: " the ", " , ", " ’ " - These are common words/punctuation. - `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`: "* This", "following", "1 model", "and", "model s" This combination doesn't immediately reveal a strong linguistic pattern in terms of semantic meaning, but it shows punctuation/common words followed by specific content. It might be related to sentence structure or parsing.2.clickbait headlines

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 muziek

0.45

 music

0.43

音樂

0.41

 музи

0.40

music

0.40

 музы

0.40

音乐

0.40

but

0.39

ceptor

0.39

編曲

0.38

POSITIVE LOGITS

 Bertha

0.40

yuk

0.40

тті

0.37

 общественных

0.37

 इसरो

0.36

 obsol

0.36

 lệnh

0.35

 unstructured

0.35

 іх

0.34

ക്കേ

0.34

Activations Density 0.001%