INDEX

Explanations

refusal and termination

refusal or rejection

New Auto-Interp

Configuration

Prompts (Dashboard)

392,802 prompts, 256 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

惚

0.42

consol

0.40

ದಗ

0.40

 oprav

0.40

 বাড়

0.39

 बढ़ी

0.39

 সন্ত

0.39

 вклю

0.38

account

0.38

 ожи

0.37

POSITIVE LOGITS

拒绝

1.48

拒

1.46

 rejection

1.45

 reject

1.43

 refusal

1.42

 rejecting

1.38

 refus

1.38

 rejects

1.37

 rifi

1.37

 رفض

1.37

Activations Density 0.806%