Reference genomes — and who's missing from them
In Module 3 you learned how sequencing works: you fragment DNA, sequence billions of short reads, and get back a FASTQ file.
In Module 3 you learned how sequencing works: you fragment DNA, sequence billions of short reads, and get back a FASTQ file. But then what? A FASTQ file is just a list of short sequences — it doesn't tell you where in the genome each read came from.
To make sense of those reads, you need to align them against a reference genome: a master sequence that acts as a coordinate system for the entire human genome. Every variant you've ever read about — in a 23andMe report, a clinical genetics paper, a ClinVar entry — is defined relative to a reference genome.
Here's the problem: that reference genome was built from a tiny, unrepresentative group of people. And the consequences for medicine are worse than most genomics textbooks admit.
By the end of this module you should be able to answer:
- What is a reference genome and why do we need one?
- How was GRCh38 built, and from whose DNA?
- Who is systematically underrepresented in the reference, and why does that matter?
- What is reference bias, and how does it create diagnostic inequity?
- What is the Human Pangenome Reference Consortium, and what problem is it solving?
---
What a reference genome actually is
A reference genome is not the genome of any specific person. It's a consensus mosaic: a single linear sequence assembled from multiple donors, designed to represent the most common base at each position across the human population.
Think of it like building a "standard English" dictionary from the speech of a small group of people, then using that dictionary to evaluate whether anyone else is speaking "correctly." If the group you sampled is not representative, the dictionary will mark normal variation in other groups as errors.
The current reference genome is called GRCh38 (Genome Reference Consortium human build 38), also known as hg38. It is 3.1 billion base pairs long and covers about 92% of the human genome — the rest, largely centromeres and highly repetitive regions, wasn't resolved until 2022.
When a sequencing lab aligns your reads to GRCh38, every position in your genome gets compared to the reference. Wherever you differ from the reference, that position gets flagged as a variant. The reference is literally the definition of "normal" in genomics.
---
How GRCh38 was built — and from whose DNA
The original human reference genome was completed in 2003 as part of the Human Genome Project. It was assembled from DNA donated by approximately 20 volunteers recruited in Buffalo, New York and other cities.
One donor contributed the majority of the sequence — roughly 70% of the final assembly. That donor is known in genomics literature as RP11, an anonymized identifier. RP11 is a white male of European ancestry.
The Human Genome Project made deliberate choices to protect donor anonymity, which is understandable. But the effect was a reference genome that overrepresents one genetic background above all others.
GRCh38, released in 2013, improved on earlier builds by incorporating additional sequences and fixing assembly errors. But its core ancestry composition remained the same: predominantly European, with small contributions from East Asian, African American, and other backgrounds. The exact proportions aren't publicly disclosed, but analyses of the reference's variant frequency patterns consistently confirm European overrepresentation.
By the numbers:
- ~70% of GRCh38's primary assembly traces to a single European-ancestry donor (RP11)
- The African population, which harbors the highest human genetic diversity of any continental group, contributes a tiny fraction of the reference
- Variants common in African, South Asian, Indigenous American, and Pacific Islander populations are systematically underrepresented
This isn't a conspiracy — it reflects who was available, willing, and recruited in the late 1990s when the Human Genome Project was building its sample pool. But the downstream effects are significant.
---
What reference bias actually does
When a sequencing pipeline aligns your reads to GRCh38, it's doing a probabilistic match: it finds the location in the reference where each read fits best. This process is called read mapping or alignment, and it has a systematic flaw called reference bias.
Here's how reference bias works:
- You sequence a person of West African ancestry
- Their reads include variants common in West African populations but rare in European populations
- The aligner struggles to place reads containing these variants — the mismatch from the reference is high
- Those reads get lower mapping quality scores, or in some cases, fail to align at all
- Those genomic positions end up with lower coverage
- The variant caller either misses the variant entirely or flags it with low confidence
The result: genomic analysis is systematically less accurate for people whose ancestry diverges most from the reference — which in practice means people of African, Indigenous, and admixed heritage.
This has real clinical consequences:
- Missed pathogenic variants: A mutation causing disease in an African-ancestry patient may sit in a region where alignment quality is poor, leading to a false negative in clinical sequencing
- Variants of Uncertain Significance (VUS) inflation: Variants that are common and benign in non-European populations get flagged as "uncertain" because the databases built on reference-aligned sequences don't have enough representation to know they're benign
- Polygenic risk score failure: Polygenic risk scores (PRS) — which predict disease risk from thousands of small variants — were developed predominantly in European-ancestry GWAS studies. They perform significantly worse in non-European populations, sometimes failing entirely
A 2019 paper in Cell by Duncan et al. found that polygenic risk scores have substantially lower accuracy in African Americans, South Asians, and Hispanics compared to Europeans — not because of biological differences, but because of data collection choices made decades ago.
---
The diversity gap in genomics databases
The reference genome is only one layer of the problem. The databases used to interpret variants — ClinVar, gnomAD, the GWAS Catalog — also skew heavily European.
gnomAD v4 (2023), the largest public variant database, contains sequencing data from approximately 730,000 individuals. The breakdown:
- ~34% European (non-Finnish)
- ~22% South Asian
- ~16% Latino/Admixed American
- ~15% African/African American
- ~9% East Asian
- ~3% Ashkenazi Jewish
- ~1% Finnish European
- <1% Middle Eastern, remaining groups
African populations are particularly underrepresented relative to their share of global genetic diversity. Africa contains more human genetic variation than all other continents combined — yet African-ancestry individuals make up only ~15% of the largest clinical genomics database.
This matters because when a clinician looks up a rare variant found in a patient, they check databases like gnomAD to see how common that variant is. If the variant is common in the patient's ancestral population but underrepresented in gnomAD, it may look like a rare, potentially pathogenic mutation when it is actually benign. This is a major driver of VUS inflation in patients of non-European ancestry.
---
The pangenome: the fix that's now in progress
The problem has been known in the genomics community for over a decade. The solution — building a reference that represents human diversity rather than a single ancestral background — took until 2023 to materialize at scale.
In May 2023, the Human Pangenome Reference Consortium (HPRC) published the first human pangenome reference in Nature. Instead of a single linear sequence, the pangenome is a graph: a data structure that encodes multiple possible paths through the genome, representing variation across 47 diverse individuals chosen specifically for ancestral diversity.
What the pangenome does differently:
- Represents 47 haplotype-resolved genomes from individuals of African, American, Asian, and European ancestry
- Contains 119 million base pairs of sequence not present in GRCh38
- Identifies 1,115 gene duplications missed by the linear reference
- Reduces reference bias in alignment — reads from non-European-ancestry individuals align more accurately to the graph
The pangenome is not yet in routine clinical use — the software pipelines (called graph aligners) are more computationally demanding than linear alignment, and clinical labs haven't fully transitioned. But it represents the most significant shift in genomic infrastructure since GRCh38.
The T2T connection: You may remember from Module 3 that the Telomere-to-Telomere (T2T) Consortium completed the first truly gapless human genome in 2022, adding 200 million base pairs of previously unresolved sequence. The HPRC pangenome incorporates T2T methods, using long-read sequencing to assemble regions that short reads couldn't resolve. The two projects together represent a wholesale upgrade of genomic infrastructure.
---
Why this is a policy problem, not just a science problem
The reference genome was built with public funding — the Human Genome Project was funded by NIH and the UK Wellcome Trust. The databases used to interpret genomic variants (ClinVar, dbSNP, gnomAD) are publicly funded. And the clinical decisions made using these tools — whether to call a variant pathogenic, whether to recommend a genetic test, whether to prescribe a drug — affect patients from every background.
The diversity gap in genomics is therefore not just a scientific limitation. It is a publicly funded inequity: resources were allocated to build infrastructure that serves some patients better than others, and the gap has compounded for two decades because the field didn't prioritize fixing it.
Several policy levers are now in motion:
- NIH's All of Us Research Program is collecting genomic data from 1 million+ Americans with explicit diversity targets — 50%+ from groups underrepresented in biomedical research
- The HPRC pangenome was funded by NHGRI (National Human Genome Research Institute) and reflects deliberate policy about representational diversity in reference infrastructure
- Some researchers argue for ancestry-stratified polygenic risk scores as an interim fix while databases build up diversity
- Others argue the entire paradigm of comparing individuals to a "normal" reference is flawed and should be replaced by graph-based population-specific analysis
The debate is unresolved. But the policy stakes — who gets accurate genetic diagnoses, who gets actionable polygenic risk scores, who benefits from precision medicine — are enormous.
---
Check yourself
These questions test whether you actually understood the module, not just whether you read it.
1. A bioinformatics pipeline maps reads from a Yoruba (West African) patient's genome to GRCh38. The reads containing a particular variant consistently receive low mapping quality scores. What is the most likely explanation? What clinical consequence could follow?
2. A clinician orders whole-genome sequencing for a South Asian patient with a suspected rare disease. The lab finds a variant and queries gnomAD. It appears in 0.003% of the gnomAD database. The clinician is concerned it may be pathogenic. What is a critical question you should ask before drawing that conclusion?
3. What is the core structural difference between the human pangenome and GRCh38? Why does that difference reduce reference bias?
4. You are advising an NIH committee on genomic equity policy. They ask whether the All of Us Research Program alone solves the reference bias problem. What would you say?
---
Key facts to remember
- GRCh38 is a linear mosaic genome; ~70% of its primary assembly traces to a single European-ancestry donor (RP11)
- Reference bias causes systematically lower alignment accuracy for reads from individuals whose ancestry diverges from the reference
- VUS inflation disproportionately affects non-European patients because benign variants common in their populations are underrepresented in interpretation databases
- gnomAD v4 contains ~730,000 individuals but African populations remain underrepresented relative to their share of global genetic diversity
- The HPRC pangenome (2023) represents 47 diverse individuals as a graph structure, reducing reference bias and adding 119M base pairs absent from GRCh38
- This is a policy problem: publicly funded genomic infrastructure was built in ways that serve some patients better than others
---
Primary sources & references
- Nurk, S. et al. (2022). "The complete sequence of a human genome." Science, 376, 44–53.
- Liao, W-W. et al. (2023). "A draft human pangenome reference." Nature, 617, 312–324.
- Duncan, L. et al. (2019). "Analysis of polygenic risk score usage and performance in diverse human populations." Cell, 178, 1–12.
- Popejoy, A. B. & Fullerton, S. M. (2016). "Genomics is failing on diversity." Nature, 538, 161–164.
- Collins, F. S. & Varmus, H. (2015). "A new initiative on precision medicine." NEJM, 372, 793–795.
- Chen, S. et al. (2024). "A genomic mutational constraint map using variation in 76,156 human genomes." Nature, 625, 92–100. (gnomAD v4)