INDEX

Explanations

inherently problematic, abstract, deceptive, dangerous

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

м

1.41

ين

1.34

ان

1.26

ー

1.21

ات

1.20

자

1.20

та

1.18

ع

1.18

人

1.16

み

1.14

POSITIVE LOGITS

MAN

0.97

0.95

0.87

 inherently

0.86

 eller

0.84

0.79

ama

0.78

DO

0.76

st

0.75

je

0.75

Activations Density 0.007%