INDEX

Explanations

harmful content and instructions

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 provides

0.32

 aids

0.32

專業

0.29

 aided

0.28

 Deluxe

0.28

 hopefully

0.27

 Made

0.27

&#

0.27

 Provides

0.27

 Gaff

0.26

POSITIVE LOGITS

 полити

0.31

defeated

0.31

proble

0.30

 gestire

0.30

ുമോ

0.30

统治

0.30

 sbagli

0.30

 politique

0.30

 accusation

0.30

 kuhusu

0.30

Activations Density 0.001%