INDEX

Explanations

extreme negative adjectives

The neuron fires on strong negative descriptors or modifiers (often broken into subword pieces like “prohib-…-itive,” “impos-…-ible,” “back-breaking,” “unbearable”) that signal extreme difficulty, cost, or severity.

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 vyrá

-1.56

notre

-1.56

bruh

-1.50

 domestiques

-1.50

 vios

-1.47

 doraemon

-1.47

 coreana

-1.45

assume

-1.45

 than

-1.45

 naruto

-1.44

POSITIVE LOGITS

sig

1.49

mistic

1.48

ка

1.48

ви

1.47

.”

1.45

har

1.45

</u>

1.43

ти

1.43

1.42

取り付け

1.41

Activations Density 0.026%