INDEX

Explanations

uncommon words

New Auto-Interp

Configuration

Prompts (Dashboard)

16,384 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

PlotsExplanationShow Test FieldDefault Test Text

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

iol

-0.30

åıĹå®³

-0.30

 breaking

-0.29

 attack

-0.29

-breaking

-0.28

çł´

-0.28

_attack

-0.28

çªģ

-0.28

 Attack

-0.27

quential

-0.27

POSITIVE LOGITS

quo

0.28

å½±

0.26

æķĽ

0.26

 sext

0.26

 Carson

0.25

Imp

0.25

imos

0.25

èį¨

0.25

 pass

0.25

èł²

0.24

Activations Density 0.884%