INDEX

Explanations

probability of null or rejection

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 sever

-0.11

 Heller

-0.10

fi

-0.09

_vlog

-0.09

fug

-0.09

sub

-0.09

fro

-0.09

 blow

-0.09

 Moor

-0.09

 incom

-0.08

POSITIVE LOGITS

 null

0.21

 Null

0.20

Null

0.19

null

0.18

Reject

0.17

 Reject

0.17

 reject

0.16

_null

0.16

 rejecting

0.15

 rejection

0.15

Activations Density 0.010%