INDEX

Explanations

deviating from normal

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ος

-0.77

✜

-0.74

istible

-0.73

 最高

-0.73

PHONY

-0.72

 Chara

-0.70

 rivalry

-0.70

rscheinlich

-0.70

]]

-0.69

 sibling

-0.69

POSITIVE LOGITS

 deviation

5.56

 deviations

5.16

 deviate

5.09

 devi

4.94

 Deviation

4.63

deviation

4.50

Deviation

4.28

 departures

4.09

 departure

3.95

devi

3.73

Activations Density 0.150%