INDEX

Explanations

cause incorrect behavior

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 stressors

-0.90

 оказалось

-0.85

 prelimin

-0.85

ulescu

-0.84

ierungs

-0.84

 Baptiste

-0.82

 Hür

-0.81

 crudo

-0.81

 budgetary

-0.80

ättern

-0.79

POSITIVE LOGITS

 incorrect

1.36

 unpredictable

1.30

 unexpected

1.30

 behavior

1.26

 surprising

1.13

 behaviour

1.13

 surprises

1.10

unexpected

1.09

 unintended

1.03

 subtle

1.02

Activations Density 0.192%