INDEX

Explanations

The pattern is related to numbers associated with specific resources, and the top positive logits are words for "women".Looking at `MAX_ACTIVATING_TOKENS` and `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`:- `MAX_ACTIVATING_TOKENS`: `5`- `TOKENS_AFTER_MAX_ACTIVATING_TOKEN`: `3`, `6`These numbers appear in the `TOP_ACTIVATING_TEXTS` in contexts like:- "...1-800-656-HOPE" (HOTLINE number)- "...1-800-799-SAFE" (DOMESTIC VIOLENCE HOTLINE number)- "3-0" (Score in a game)The `TOP_POSITIVE_LOGITS` are all words for "women" in different languages.This neuron seems to activate when sequences of numbers appear, especially in the context of support hotlines or scores, and it's associated with the concept of "women".Let's re-evaluate. The `MAX_ACTIVATING_TOKENS` is literally just the number `5`. The `TOKENS_AFTER_MAX_ACTIVATING_TOKEN` are `3` and `6`. These numbers often appear together in lists, especially lists of phone numbers.1-800-656-HOPE

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

Setpoint

0.44

 recesses

0.44

ABS

0.41

search

0.40

arr

0.38

BMI

0.38

 Phil

0.38

 wisdom

0.38

abs

0.38

gap

0.37

POSITIVE LOGITS

 ಮಹಿ

0.60

 женщины

0.59

 महिलाओं

0.59

 महिलाएं

0.57

 vrouwen

0.55

 женщин

0.54

 عورت

0.54

 womens

0.54

 womans

0.54

 महिला

0.53

Activations Density 0.009%