INDEX

Explanations

becoming

The neuron flags statements of apology or disclaimer—phrases where someone denies wrongdoing or accepts accountability (e.g. “racism was never his intention,” “takes full accountability”).

New Auto-Interp

Configuration

Prompts (Dashboard)

392,802 prompts, 256 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

0.66

0.64

!!

0.64

0.63

0.61

/).

0.60

?).

0.60

ic

0.59

POSITIVE LOGITS

 “[

1.52

"[

1.49

 "...

1.46

 “…

1.44

 “‘

1.42

 “(

1.31

"'

1.28

"..

1.27

"(

1.26

“[

1.25

Activations Density 0.430%