INDEX

Explanations

refusing harmful requests as helpful AI

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

regator

0.45

レスト

0.44

ງ

0.38

้ง

0.38

мал

0.38

 বন্ধুদের

0.37

 trio

0.37

:...

0.37

শোর

0.36

ковый

0.36

POSITIVE LOGITS

Narc

0.43

 বিজ্ঞানী

0.43

 Narc

0.42

ABP

0.42

ళ్ళు

0.41

alt

0.39

 narc

0.38

 narciss

0.38

AMPLE

0.37

 narcissistic

0.36

Activations Density 0.006%