INDEX

Explanations

pose a threat/risk/violation

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

黴

0.62

ዝ

0.58

高峰

0.57

 주의

0.56

 cautious

0.56

ow

0.56

افظ

0.56

ah

0.55

ాలంటే

0.55

 Junction

0.54

POSITIVE LOGITS

 teeth

0.80

 existential

0.69

 tooth

0.69

 Trojan

0.68

ɬ

0.67

 Transform

0.67

 Teeth

0.66

 attack

0.66

 dientes

0.66

 camada

0.65

Activations Density 0.147%