INDEX

Explanations

prevent, protect, limit

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 amel

-0.13

hur

-0.10

 combating

-0.10

 reconc

-0.09

é¾

-0.09

 restraining

-0.09

OrNull

-0.09

 refr

-0.09

áº£i

-0.09

æĪĴ

-0.09

POSITIVE LOGITS

 prevent

0.24

 protect

0.23

 protecting

0.20

 prevented

0.19

 Protect

0.18

 prevents

0.18

ä¿ĿæĬ¤

0.18

 limit

0.17

 preventing

0.17

 protects

0.17

Activations Density 0.070%