INDEX

Explanations

requests or scenarios that involve harmful or inappropriate behaviors related to power dynamics and sexual exploitation.

I strongly / I must

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

他們

0.36

 USING

0.35

他们

0.35

插入

0.35

츰

0.34

يند

0.34

位

0.33

 hooks

0.33

%%

0.32

 extracted

0.32

POSITIVE LOGITS

 wholeheartedly

0.67

 gladly

0.64

 encourage

0.55

 gratefully

0.55

 believe

0.54

 encour

0.52

 sincerely

0.52

 heartily

0.49

鼓励

0.49

 regret

0.48

Activations Density 0.642%