INDEX

Explanations

potential to break or impact

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 discouraging

0.46

頹

0.45

黯

0.42

 disappointment

0.39

 жал

0.38

 dissipation

0.38

限界

0.37

 translational

0.37

 stimulant

0.37

乏

0.37

POSITIVE LOGITS

 wreak

0.96

 broke

0.95

 messes

0.93

 messed

0.92

 breaking

0.90

 break

0.88

wre

0.88

 screwed

0.85

 messing

0.84

 breaks

0.83

Activations Density 0.025%