INDEX

Explanations

harmful or explicit scenarios

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ท์

1.02

てる

0.96

𓈒

0.94

るのは

0.91

นุ

0.89

るので

0.88

 Koff

0.87

ógica

0.87

hwar

0.87

dt

0.86

POSITIVE LOGITS

其他

1.23

не

1.19

ী

1.09

ز

1.09

ারে

1.02

 caucasian

0.99

ется

0.98

ية

0.97

ﺌ

0.95

ک

0.94

Activations Density 0.108%