INDEX

Explanations

declining harmful requests model

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

pseudo

0.72

autos

0.69

line

0.68

 শহীদ

0.66

伪

0.63

inters

0.61

 pseudo

0.60

SequentialGroup

0.59

ster

0.59

Pseudo

0.59

POSITIVE LOGITS

 parable

0.69

猜

0.66

 UIText

0.64

 보면은

0.62

樀

0.62

 technological

0.61

忶

0.61

㷅

0.61

 platforms

0.61

 patty

0.60

Activations Density 0.070%