INDEX

Explanations

possessive or controlling actions

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 defiant

0.39

 defiance

0.37

 داخل

0.36

 subversive

0.36

 പോലുള്ള

0.35

 کمتر

0.34

暐

0.34

 தை

0.33

dagen

0.33

 внутри

0.33

POSITIVE LOGITS

 threatening

0.36

 threatened

0.34

 encro

0.34

 merciless

0.33

reactant

0.33

 threatens

0.33

 accusing

0.33

 bully

0.32

 wanting

0.30

 reciproc

0.30

Activations Density 0.159%