INDEX

Explanations

refusal of harmful requests

New Auto-Interp

Configuration

Prompts (Dashboard)

392,802 prompts, 256 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

udir

0.84

pand

0.76

ivate

0.71

álni

0.70

antam

0.69

 stoi

0.67

0.65

jom

0.65

POSITIVE LOGITS

2.16

€™

1.55

Tis

1.54

sG

1.48

’’’’

1.43

ਤੇ

1.37

র

1.34

ا

1.33

sse

1.32

ㅅ

1.31

Activations Density 0.459%