INDEX

Explanations

protecting the vulnerable

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

প

0.60

 alebo

0.57

 உட்பட

0.56

zné

0.55

 pouze

0.54

”

0.53

 potre

0.52

POSITIVE LOGITS

 protecting

0.78

 protect

0.76

 protected

0.73

 melindungi

0.71

д

0.68

 protects

0.68

ל

0.65

 보호

0.64

 захи

0.63

 vulnerable

0.63

Activations Density 0.070%