INDEX

Explanations

refusal due to harmful content

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 eyeballs

0.60

oter

0.60

 showError

0.60

ㄙ

0.57

Twenty

0.56

新型コロナ

0.55

OG

0.54

APPRO

0.53

ɸ

0.53

否定

0.53

POSITIVE LOGITS

 Come

0.79

 come

0.78

Come

0.67

थिक

0.61

kd

0.58

 Clash

0.58

icii

0.57

 Crown

0.57

 COME

0.57

тных

0.56

Activations Density 0.236%