INDEX

Explanations

price, work, gay, home, gang, shaped, pressing

The neuron strongly activates on tokens naming or describing extremist ideology or hate‐group labels (e.g. “neo‐Nazi,” “white supremacist”).

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

呆

-0.93

ferous

-0.93

 assez

-0.92

át

-0.88

вен

-0.86

叭

-0.85

own

-0.84

ebaran

-0.83

ที่มา

-0.83

 और

-0.82

POSITIVE LOGITS

 peinado

1.05

 vuonna

1.01

ükü

0.96

imaginary

0.96

esas

0.92

 salvajes

0.91

piv

0.90

Imaginary

0.89

™,

0.89

 کے

0.89

Activations Density 0.117%