INDEX

Explanations

innocent, innocence, or harmlessness

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 verpflicht

0.40

枌

0.38

फलता

0.37

 Flexible

0.37

fab

0.36

 etik

0.36

砣

0.35

hesive

0.35

 облі

0.35

黯

0.35

POSITIVE LOGITS

 innocent

3.98

 innocence

3.72

 innoc

3.58

 Innocent

3.38

 Innoc

3.33

 inoc

2.94

 innocuous

2.27

inn

2.23

 harmless

2.11

 inoculation

2.00

Activations Density 0.040%