INDEX

Explanations

harmful or exploitative content

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

0.79

<h1>

0.75

Table

0.72

Draw

0.70

—

0.70

（

0.69

See

0.67

---

0.67

View

0.65

0.64

POSITIVE LOGITS

LEC

0.88

 Oversight

0.87

 incapacity

0.86

<unused1888>

0.86

<unused368>

0.85

 مذہبی

0.84

 russe

0.83

<unused1044>

0.83

<unused2145>

0.83

 hating

0.83

Activations Density 0.350%