INDEX

Explanations

this paper investigates/examines/explores

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ه

0.86

י

0.65

Según

0.60

 zwią

0.59

Tesla

0.58

Aunque

0.58

Após

0.58

ق

0.57

بعد

0.56

ب

0.56

POSITIVE LOGITS

asse

0.73

কারি

0.63

estic

0.61

ads

0.61

asters

0.60

становка

0.59

ський

0.59

✶

0.58

ន៍

0.57

徑

0.57

Activations Density 0.005%