INDEX

Explanations

available to, evaluated for, calculated in, reveal vulnerabilities

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

лянчук

0.55

жай

0.49

Ꮘ

0.49

脛

0.47

 kwaliteit

0.44

堰

0.44

తున్నాయి

0.43

Კ

0.43

ཋ

0.43

ין

0.43

POSITIVE LOGITS

 when

0.69

khi

0.59

to

0.55

is

0.54

 that

0.54

 with

0.54

when

0.53

on

0.49

 final

0.47

 When

0.46

Activations Density 0.003%