INDEX

Explanations

references to discrimination based on race, religion, and other identity markers

New Auto-Interp

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 collapses

-0.70

 tabl

-0.68

merce

-0.65

 Administ

-0.60

 cheat

-0.60

stress

-0.59

 Finder

-0.59

ologue

-0.58

cheat

-0.58

••

-0.57

POSITIVE LOGITS

 Race

0.72

Gender

0.72

 Gender

0.71

Race

0.68

loc

0.67

 gender

0.67

ku

0.67

alore

0.64

imar

0.64

lation

0.64

Activations Density 0.107%