INDEX

Explanations

defamation and malicious actions

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

<0xCF>

0.45

 индика

0.44

rique

0.43

segn

0.41

 inesper

0.41

betrieb

0.40

 Конечно

0.40

电阻

0.40

 편리

0.40

неоп

0.40

POSITIVE LOGITS

 defamation

1.16

 defamatory

1.09

 slander

1.02

 libel

0.97

诽

0.91

 defam

0.89

 publication

0.78

 publications

0.73

誹

0.70

诬

0.68

Activations Density 0.037%