INDEX

Explanations

corruption and bribery

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

LTR

-0.09

 fores

-0.09

ãĢķ

-0.09

ugi

-0.09

anka

-0.09

 hoax

-0.09

è´£

-0.09

_warnings

-0.09

pun

-0.09

POSITIVE LOGITS

 corruption

0.43

 corrupt

0.40

 Corruption

0.36

bri

0.35

 corrupted

0.31

èħĲ

0.31

 graft

0.30

Bri

0.29

 brib

0.28

 bribery

0.27

Activations Density 0.073%