INDEX

Explanations

rape and harassment

New Auto-Interp

Configuration

Prompts (Dashboard)

16,384 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 violating

-0.78

 harassing

-0.76

 Viol

-0.73

 baptism

-0.73

 violated

-0.71

 publiques

-0.70

SharedDtor

-0.69

 démocr

-0.68

 intimidate

-0.66

 intimidating

-0.66

POSITIVE LOGITS

 control

0.52

 agents

0.52

Autoritní

0.51

NUMX

0.51

 controlled

0.49

 agent

0.47

 open

0.47

styles

0.47

 minute

0.46

ERICA

0.46

Activations Density 0.044%