INDEX

Explanations

unnecessary, risky, unreliable, problematic

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

bzw

0.41

を確認

0.40

 numeri

0.39

ⴱ

0.39

 allons

0.38

Мар

0.38

urb

0.38

瑢

0.38

adapter

0.38

 pivotal

0.38

POSITIVE LOGITS

 unnecessary

0.83

 unnecessarily

0.69

 अनावश्यक

0.68

 risky

0.64

 unreliable

0.63

 undermines

0.62

 interferes

0.56

 risks

0.55

 problematic

0.55

 needlessly

0.55

Activations Density 0.200%