INDEX

Explanations

can benefit

policy-safety refusal language, especially content addressing sexual misconduct, coercion, or harm and outlining ethical/legal boundaries or help resources.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

Crew

0.47

 ടീ

0.46

createServer

0.45

querySelectorAll

0.44

 Woolf

0.44

cknowledg

0.42

 ToolsVersion

0.42

ValueSet

0.42

ങ

0.42

吴

0.41

POSITIVE LOGITS

spr

0.45

Sub

0.43

OEM

0.42

 definition

0.42

 margin

0.41

Tou

0.41

 harbour

0.40

asci

0.40

 vector

0.39

 estimate

0.39

Activations Density 14.992%