INDEX

Explanations

undetectable subtle changes

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 demanded

0.41

 motivated

0.38

 яр

0.36

 vouloir

0.36

 definitiv

0.35

鞦

0.35

ますが

0.35

emph

0.35

겁

0.34

 ആവശ

0.34

POSITIVE LOGITS

 unnoticed

2.36

 invisible

2.28

 hidden

2.27

 secretly

2.16

invisible

2.09

 undetected

2.06

 invis

2.03

 unseen

2.03

hidden

2.00

Invisible

2.00

Activations Density 0.057%