INDEX

Explanations

No Explanations Found

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 marav

0.48

强大

0.43

 supremo

0.43

 Faced

0.43

 slaughtered

0.42

豪華

0.42

 понадоби

0.41

 spared

0.40

:)

0.40

 ಚೆ

0.40

POSITIVE LOGITS

 harmful

1.73

 unacceptable

1.55

 problematic

1.52

 distressing

1.51

 disturbing

1.50

 troubling

1.50

 damaging

1.44

 detrimental

1.44

 unhealthy

1.34

 unsettling

1.33

Activations Density 0.965%

No Known Activations