INDEX
Explanations
references to discrimination based on race, religion, and other identity markers
New Auto-Interp
Negative Logits
collapses
-0.70
tabl
-0.68
merce
-0.65
Administ
-0.60
cheat
-0.60
stress
-0.59
Finder
-0.59
ologue
-0.58
cheat
-0.58
••
-0.57
POSITIVE LOGITS
Race
0.72
Gender
0.72
Gender
0.71
Race
0.68
loc
0.67
gender
0.67
ku
0.67
alore
0.64
imar
0.64
lation
0.64
Activations Density 0.107%