INDEX

Explanations

The challenge/most/trick/danger/critical

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 defam

0.44

gham

0.39

日が

0.39

 ಶ್ರೀ

0.38

вро

0.38

 ibid

0.37

 pice

0.37

䤃

0.37

 jahr

0.36

HOBBIT

0.36

POSITIVE LOGITS

 importantly

0.57

 advantage

0.54

 benefit

0.51

 drawback

0.51

 important

0.50

 irony

0.47

遗憾

0.47

 trick

0.47

 caveat

0.46

 Vorteil

0.46

Activations Density 0.046%