INDEX

Explanations

based on protected characteristics

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ingle

-0.13

 Weston

-0.09

kie

-0.09

icana

-0.09

ãİ

-0.08

 Grim

-0.08

 impartial

-0.08

iset

-0.08

 Hers

-0.08

TextEdit

-0.08

POSITIVE LOGITS

 race

0.19

race

0.15

 Race

0.13

 grounds

0.13

 their

0.13

 gender

0.12

 protected

0.12

 skin

0.12

grounds

0.12

Race

0.12

Activations Density 0.038%