INDEX

Explanations

monogamy or poly relationships

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 heterosexual

-0.13

 homosexuality

-0.12

 homosexual

-0.12

 marrying

-0.11

 homosex

-0.10

 marry

-0.10

 homosexuals

-0.10

 homophobic

-0.10

 corpus

-0.09

 Indo

-0.09

POSITIVE LOGITS

 poly

0.29

 Poly

0.27

Poly

0.22

poly

0.22

(poly

0.18

_poly

0.18

.poly

0.17

mon

0.16

 jealousy

0.15

 veto

0.14

Activations Density 0.079%