INDEX

Explanations

good intentions and character

The neuron fires on words that express positive intentions or goodwill (e.g. “well-meaning,” “helpful,” “good”)—i.e. benevolent motivations.

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

与此同时

-1.59

嗄

-1.44

 OGSÅ

-1.43

除此

-1.37

我和

-1.36

 determining

-1.35

ｋ

-1.34

继续阅读

-1.34

最后由

-1.34

ಞ

-1.32

POSITIVE LOGITS

↵↵

2.13

1.78

',

1.55

,”

1.50

1.46

",

1.41

這個

1.40

—

1.38

,.

1.38

もいい

1.34

Activations Density 0.013%