INDEX

Explanations

large language model created by

The neuron strongly activates on the pattern where the model refers to itself as “a large language model,” i.e. self-identification phrases stating “As a large language model…”

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 capables

0.48

 означа

0.41

 offrant

0.40

 capaces

0.40

 схема

0.39

 tribulations

0.39

 система

0.38

 வரவே

0.38

 없고

0.38

 hereby

0.38

POSITIVE LOGITS

 goes

0.42

 Rainbow

0.42

科

0.42

 رفت

0.41

 itself

0.40

我是

0.40

 моего

0.40

 অস্বাভাবিক

0.39

 Southwest

0.39

 cleaned

0.38

Activations Density 0.017%