INDEX

Explanations

hate speech example

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 Fraud

-0.09

fak

-0.09

 Bere

-0.08

_traits

-0.08

 fraud

-0.08

 Synd

-0.08

æ¬

-0.08

intr

-0.08

 Ø¯Ø±Ø¬Ùĩ

-0.08

 èŃ

-0.08

POSITIVE LOGITS

 original

0.17

 hate

0.15

 statement

0.13

 speech

0.13

original

0.13

(original

0.12

 message

0.12

 Hate

0.12

 initial

0.12

 argument

0.11

Activations Density 0.058%