INDEX

Explanations

content that promotes harmful stereotypes and hate speech against marginalized groups.

I understand

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

this

0.19

 ardından

0.19

↵

0.18

 この

0.16

 کیا۔

0.16

 tämä

0.16

 هذا

0.16

 tomto

0.16

。

0.16

 تاسو

0.16

POSITIVE LOGITS

 může

0.21

 pueden

0.21

 môže

0.20

 peut

0.19

 puede

0.19

 можуть

0.19

 certainly

0.18

 mohou

0.18

 mogą

0.18

 może

0.17

Activations Density 0.438%