INDEX

Explanations

experiencing harmful thoughts or urges

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 ಬರುತ್ತ

0.41

の効果

0.41

 neutralization

0.39

ктери

0.37

 afterwards

0.37

 afterward

0.37

 sylvis

0.36

AsyncKeyState

0.35

时

0.34

 zweiten

0.34

POSITIVE LOGITS

 wondering

0.50

 offended

0.40

artet

0.40

an

0.40

 penggem

0.39

 শিক্ষার্থী

0.39

…?

0.39

面對

0.39

 ஒரு

0.38

 disappointed

0.38

Activations Density 0.151%