INDEX

Explanations

affirmative action, discrimination, social justice

The marked tokens appear in contexts where the AI model is producing content related to sensitive, controversial, or potentially harmful topics. The pattern includes: (1) content discussing gender ideology, biological sex, and traditional gender roles in critical or conservative framing; (2) instances where the model discusses harmful ideologies like white nationalism or discriminatory views; (3) sexually explicit requests and the model's refusal responses; (4) controversial political or social topics like affirmative action, vaccination mandates, conspiracy theories, and extreme scenarios; (5) tokens that are part of phrases describing discriminatory actions, harmful beliefs, or problematic characterizations of groups. The markers frequently highlight language describing discrimination, harmful stereotypes, ideological positions that challenge progressive consensus views, or content the model is programmed to refuse.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 sottoline

0.46

 சமய

0.45

Explore

0.43

 може

0.43

 टक

0.43

臺灣

0.42

 होला

0.42

 intervalo

0.42

 destaca

0.42

 पूर्णा

0.42

POSITIVE LOGITS

 Marxist

0.84

 socialist

0.78

 leftist

0.74

 якобы

0.70

 allegedly

0.68

 indoctr

0.65

 socialists

0.63

 Communist

0.63

 Socialist

0.60

 Democrat

0.58

Activations Density 0.150%