INDEX

Explanations

unauthorized, unknown, never

words indicating refusal, warnings, or ethical objections to harmful requests.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 déprim

0.44

லா

0.43

VisualStyle

0.41

 кое

0.41

 depressing

0.41

 قابل

0.40

难以

0.40

 करताना

0.40

可用

0.40

អាច

0.40

POSITIVE LOGITS

知らない

0.60

 अनजान

0.59

 unauthorized

0.56

 unfounded

0.56

 inexist

0.56

 unknowingly

0.56

 never

0.55

不存在

0.55

 nunca

0.55

 desconoc

0.55

Activations Density 0.154%