INDEX

Explanations

unethical, unsafe conditions

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 thereon

0.63

 asimismo

0.62

 thus

0.60

 тако

0.59

Trajectory

0.59

 thereby

0.57

 geodesic

0.57

杈

0.57

 FURTHER

0.57

hede

0.56

POSITIVE LOGITS

 unethical

0.84

不能

0.82

 stressful

0.80

 માત્ર

0.78

 केवळ

0.78

 shouldn

0.77

 unsafe

0.76

不上

0.76

 不能

0.75

 unhappy

0.75

Activations Density 0.338%