INDEX

Explanations

offensive language and content

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 ance

-0.10

ãģ¦

-0.09

ãĢħ

-0.09

worth

-0.09

fort

-0.09

uter

-0.09

thread

-0.09

erver

-0.09

mon

-0.08

POSITIVE LOGITS

ensively

0.13

 lineman

0.13

/off

0.12

ensive

0.11

-language

0.11

iveness

0.11

hand

0.10

rant

0.10

åªĴ

0.10

beat

0.10

Activations Density 0.022%