INDEX

Explanations

mimics malicious activity

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

之间的

0.49

聃

0.47

HttpSession

0.45

 clockRadius

0.44

嫔

0.44

 searchQuery

0.44

 Gmail

0.43

 LinkedIn

0.42

上网

0.42

 Hypertension

0.42

POSITIVE LOGITS

 eine

0.51

flag

0.48

not

0.47

produce

0.46

 promotes

0.45

 signalled

0.45

_=

0.45

own

0.45

 finds

0.44

wa

0.44

Activations Density 0.003%