INDEX

Explanations

"I" followed by refusal or acknowledgment

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

お願いします

0.67

etc

0.63

即可

0.61

>();

0.60

 respective

0.60

 blocking

0.60

if

0.59

.}

0.58

 jika

0.58

 nearby

0.58

POSITIVE LOGITS

 admittedly

1.00

 realize

0.87

 realized

0.82

 Admittedly

0.82

 acknowledge

0.82

 понимаю

0.81

mittedly

0.81

 embarked

0.81

 recognize

0.80

之所以

0.80

Activations Density 0.229%