INDEX

Explanations

harmful slurs and hate speech

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 گھ

0.40

Pul

0.39

Gioco

0.39

 Medicines

0.39

徬

0.39

Cut

0.38

晕

0.38

 Liver

0.37

 pinched

0.37

 stung

0.37

POSITIVE LOGITS

原子力

0.40

chmal

0.38

etera

0.38

 सहकारी

0.38

 புதி

0.36

untansi

0.36

(&

0.35

ประมาณ

0.35

ा

0.35

 निर्मित

0.35

Activations Density 0.001%