INDEX

Explanations

Reinforcement Learning from Human Feedback

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

Gew

0.42

 Sauer

0.38

MDC

0.38

㽛

0.38

Eg

0.37

 SDLK

0.36

 করপোর

0.35

MORDOR

0.35

civ

0.35

 Morley

0.34

POSITIVE LOGITS

 kick

0.43

pf

0.42

 kicked

0.39

 Kick

0.39

kick

0.38

 humiliation

0.38

PF

0.37

pf

0.37

 kicks

0.36

跎

0.35

Activations Density 0.008%