INDEX

Explanations

hostile actions and targets

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 इश्

0.33

 энергия

0.33

Ů

0.32

 privind

0.30

﨑

0.30

在日本

0.29

Ł

0.29

眼神

0.29

誇

0.29

ør

0.28

POSITIVE LOGITS

 unsuspecting

0.51

the

0.46

him

0.45

 them

0.41

 someone

0.38

 whoever

0.38

 direttamente

0.38

 anyone

0.36

 heavily

0.36

 directamente

0.35

Activations Density 0.072%