INDEX

Explanations

harmful content refusal

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

姮

0.78

 achievements

0.77

 dreamy

0.76

 involution

0.75

cią

0.73

 grinning

0.73

 magician

0.72

 Adventures

0.72

Styled

0.72

കൊണ്ട്

0.72

POSITIVE LOGITS

द्वितीय

0.69

详细

0.67

वार

0.67

בה

0.64

↵↵↵↵↵↵

0.62

یکی

0.60

 Memorial

0.59

ini

0.58

’’

0.58

'''

0.58

Activations Density 0.093%