Genomics7 of 1345 minModules 1–6

GWAS and complex traits — why height isn't in one gene

Everything in Module 6 was about single-gene disease: one broken gene, one disease.

Start here

Everything in Module 6 was about single-gene disease: one broken gene, one disease. Cystic fibrosis. Huntington's. BRCA1-associated cancer. These are called Mendelian diseases — they follow the inheritance patterns Gregor Mendel described in the 1860s.

But most human traits and diseases don't work that way. Height. Intelligence. Type 2 diabetes. Schizophrenia. Coronary artery disease. These are complex traits — influenced by thousands of genetic variants simultaneously, each contributing a tiny effect, interacting with each other and with the environment. No single gene explains them. No single variant predicts them.

The tool built to study complex traits at scale is the genome-wide association study (GWAS) — one of the most productive and most misunderstood methods in modern genomics. By the end of this module you should understand how GWAS works, what it actually finds, why it has been criticized, and what it can and cannot tell you about biology.

By the end of this module you should be able to answer:

What is a complex trait and how does it differ from a Mendelian trait?
How does a GWAS work, at the statistical and biological level?
What is linkage disequilibrium and why does it matter for interpreting GWAS results?
What is a Manhattan plot and how do you read one?
What are the limitations of GWAS — missing heritability, winner's curse, non-European bias?
What is a polygenic score and what can it actually predict?

---

Mendelian vs. complex traits

In Mendelian disease, the genetic architecture is simple: one gene, one disease, large effect. A patient with two loss-of-function CFTR variants will have cystic fibrosis. The variant explains essentially all of the genetic risk.

Complex traits have a fundamentally different architecture. Take height: it is approximately 80% heritable — meaning 80% of the variation in height across people is explained by genetic differences, not environment. But that 80% is distributed across thousands of genetic variants, each explaining a tiny fraction of a percent of height variation. The largest single common variant for height explains about 0.4 cm — essentially nothing on its own.

This is called polygenic architecture: many variants, each with small effect, collectively explaining a large heritable component.

The same is true for most common diseases:

Type 2 diabetes: hundreds of associated loci, each increasing risk by a few percent
Schizophrenia: hundreds of loci; the largest common variant association has an odds ratio of ~1.1
Coronary artery disease: 300+ associated loci
Educational attainment: 1,271 significant loci in the most recent large GWAS

This architecture has profound implications. You cannot identify a "type 2 diabetes gene" the way you identify a "cystic fibrosis gene." The biology is distributed, not concentrated.

---

How a GWAS works

A genome-wide association study is a statistical comparison between two groups: people with a trait or disease (cases) and people without (controls). For quantitative traits like height, you compare across a continuous distribution instead.

The basic logic:

For each of millions of SNPs across the genome, ask: is the frequency of this allele different between cases and controls? If people who carry the A allele at a given position are significantly more likely to have type 2 diabetes than people who carry the G allele — controlling for ancestry and other confounders — that SNP is associated with the disease.

Step by step:

Genotyping: Participants' DNA is genotyped on a SNP array — a chip that simultaneously measures 500,000–5,000,000 SNPs. This is much cheaper than whole-genome sequencing ($50–200 per sample vs. $1,000+).

Imputation: Most SNPs in the genome aren't on the array. Using reference panels (like the 1000 Genomes Project), researchers statistically impute the genotypes at untyped positions based on known patterns of linkage disequilibrium (more on this below). This effectively expands coverage to 10–30 million SNPs.

Quality control: Remove samples with low call rates, ancestry outliers, or cryptic relatedness. Remove SNPs with low call rates, deviation from Hardy-Weinberg equilibrium, or very low minor allele frequency.

Association testing: For each SNP, run a regression: phenotype ~ genotype + covariates (age, sex, ancestry principal components). The output is an effect size (beta or odds ratio) and a p-value.

Multiple testing correction: You've run millions of tests. By chance alone, many will be "significant" at p < 0.05. The standard GWAS significance threshold is p < 5 × 10⁻⁸ — chosen to control the genome-wide false positive rate at roughly 5%. Only hits above this line are reported as significant.

Replication: True associations replicate in independent cohorts. Associations that don't replicate are almost certainly false positives.

---

Linkage disequilibrium — why one signal represents many variants

A critical concept for understanding GWAS results is linkage disequilibrium (LD): the tendency for nearby variants to be inherited together more often than chance would predict.

When a new mutation arises, it occurs on a specific chromosome — embedded in a stretch of sequence with a particular set of alleles at nearby positions. Over generations, recombination breaks up these associations, but nearby variants recombine less frequently (they're less likely to be separated by a recombination event), so they stay correlated for many generations.

The practical consequence: when a GWAS finds a significant SNP, that SNP is usually not the causal variant itself — it's a tag SNP that happens to be in LD with the actual causal variant. The causal variant could be any of dozens of other SNPs in the same LD block.

r² measures LD: r² = 1 means two SNPs are perfectly correlated (they always travel together); r² = 0 means they're uncorrelated.

This has major implications for interpretation. A GWAS "hit" identifies a genomic region harboring a causal signal — not necessarily the causal variant. Finding the actual causal variant requires fine mapping: analyzing the full set of variants in the LD block to identify which one is most likely causal, often combined with functional data (eQTL analysis, chromatin accessibility) to prioritize candidates.

---

Reading a Manhattan plot

GWAS results are displayed as a Manhattan plot: a scatter plot where:

The x-axis represents genomic position (chromosomes laid out left to right, 1–22 then X)
The y-axis represents -log₁₀(p-value) — higher means more significant
Each dot is a SNP

The name comes from the skyline-like appearance when multiple peaks reach genome-wide significance: tall towers rising above a flat baseline.

What you're looking for:

Peaks above the red line (p < 5 × 10⁻⁸): genome-wide significant associations
Width of peaks: Broader peaks indicate stronger LD in the region; narrow peaks indicate weaker LD
Multiple peaks on the same chromosome: Often multiple independent signals within a region, or in completely different genes

What the peaks don't tell you:

Which gene the signal affects (could be an enhancer of a distant gene)
Whether the associated SNP is causal or just a tag
The direction or magnitude of effect on biology

A typical GWAS Manhattan plot for a well-powered complex disease study might show 50–300 significant peaks. Each peak represents a genomic region — a locus — but identifying the causal gene and mechanism at each locus requires extensive follow-up work.

---

What GWAS findings usually are — and aren't

The most common misreading of GWAS results is to look at the gene nearest a significant SNP and declare that "the gene for X." This is almost always wrong in at least one of several ways.

Most GWAS SNPs are in non-coding sequence. As you learned in Module 5, the vast majority of GWAS hits fall in enhancers, introns, and intergenic regions — not in protein-coding exons. They don't change protein sequence; they alter gene expression levels, timing, or cell-type specificity. Identifying the affected gene requires eQTL data and functional genomics — not just looking at which gene is nearest.

The nearest gene is often not the affected gene. Enhancers can regulate genes hundreds of kilobases away, looping across the genome. A GWAS hit "near" gene A may actually regulate gene B on the other side of the chromosomal neighborhood.

Effect sizes are tiny. The largest common-variant GWAS hits for complex diseases have odds ratios of 1.1–1.5. An odds ratio of 1.1 means carrying the risk allele increases your odds of disease by 10%. For context: smoking increases lung cancer odds by ~2,000%. Individual GWAS hits are not clinically actionable on their own.

What GWAS is genuinely good for:

Identifying biological pathways involved in a disease — if 50 hits in a schizophrenia GWAS cluster in genes involved in glutamate signaling, that implicates glutamate biology in the disease
Generating drug targets — GWAS hits in genes encoding drug targets predict which drugs will work (Mendelian randomization)
Building polygenic scores that aggregate thousands of small effects into a predictive tool

---

Missing heritability

Early GWAS of complex traits were expected to find variants explaining most of the heritability. They didn't. Even after thousands of GWAS studies and millions of participants, identified common variants explain only a fraction of the estimated heritability for most traits.

For height: the heritability is ~80%. The largest height GWAS (5.4 million people, 2022) identified 12,111 significant variants explaining about 40% of height variance. Half the heritable variation is still unexplained.

This is called the missing heritability problem. Where is the rest?

Current evidence suggests several explanations:

Rare variants: GWAS arrays capture common variants (minor allele frequency > 1–5%). Rare variants with larger effect sizes are missed. Whole-genome sequencing in large cohorts is beginning to find some of these.

Gene-gene interactions (epistasis): The effect of one variant may depend on the genotype at another locus. Standard GWAS tests each variant independently and misses interactions.

Gene-environment interactions: A variant's effect may depend on environment (diet, stress, exposure). Standard GWAS doesn't model this.

Structural variants: Large deletions, duplications, and inversions are not well-captured by SNP arrays and contribute to heritability in ways not fully counted.

Imperfect heritability estimates: Twin-based heritability estimates may overestimate true genetic heritability in some cases.

The missing heritability debate has been productive: it drove the field toward whole-genome sequencing, biobank-scale studies (UK Biobank: 500,000 participants), and more sophisticated statistical models.

---

Polygenic scores — aggregating small effects

If no single GWAS variant is clinically informative, can you aggregate thousands of them into a score that is? This is the logic behind polygenic scores (PGS), also called polygenic risk scores (PRS).

A polygenic score is calculated by:

Taking GWAS summary statistics (effect sizes for each associated SNP)
For an individual, multiplying each SNP's effect size by their genotype (0, 1, or 2 copies of the risk allele)
Summing across all SNPs to produce a single number — the individual's polygenic score

People with high polygenic scores for coronary artery disease have 3–5x higher lifetime risk than people with low scores, even when no single variant in their genome is individually alarming. The score is genuinely predictive at the population level.

What PGS can do:

Stratify populations by disease risk for preventive intervention
Identify people at high risk of early-onset disease who wouldn't be flagged by traditional risk factors
Potentially guide drug selection (high PGS for LDL cholesterol → benefit from statins)

What PGS cannot do:

Predict disease for an individual with meaningful precision (population-level statistics don't translate cleanly to individual risk)
Replace clinical risk factors, family history, or environmental assessment
Work equally well across ancestries

That last point connects directly back to Module 4. Polygenic scores are built from GWAS summary statistics — and GWAS has been conducted overwhelmingly in European-ancestry populations. A score trained on European GWAS data performs significantly worse in African, South Asian, and admixed populations, for two reasons:

LD patterns differ across populations: The tag SNPs in the score are in LD with causal variants in Europeans. In a different ancestry, the LD structure is different, so the same tag SNPs don't capture the same causal variants as well.

Allele frequencies differ: Effect sizes estimated in European cohorts may not be accurate in other populations if the genetic background differs.

A 2019 analysis (Duncan et al., Cell) found polygenic score accuracy for schizophrenia, type 2 diabetes, and other traits dropped by 50–70% when applied to non-European populations. This is the polygenic score failure you first encountered in Module 4 — now you have the mechanistic explanation.

---

Mendelian randomization — GWAS as a causal inference tool

One of the most powerful applications of GWAS is a technique called Mendelian randomization (MR): using genetic variants as natural experiments to test causal hypotheses.

The problem in epidemiology is confounding: people who drink coffee are different from people who don't in dozens of ways, making it hard to isolate whether coffee itself causes any health outcome. But genetic variants that influence coffee consumption (several have been found by GWAS) are assigned randomly at conception — they're not correlated with lifestyle, income, or other confounders in the same way.

If people with higher genetic scores for coffee consumption have higher rates of outcome X, that's evidence X is causally influenced by coffee — because the genetic instrument is randomized.

MR is also used to validate drug targets. If a GWAS variant that naturally lowers LDL cholesterol is associated with lower cardiovascular risk, that validates LDL as a causal risk factor — and validates LDL-lowering drugs. Conversely, if a GWAS variant lowers a biomarker but doesn't affect disease risk, the biomarker may not be causally relevant, and drugs targeting it may not work.

This is increasingly used in drug development to prioritize targets with genetic evidence, which has been shown to roughly double the probability of clinical trial success.

---

Check yourself

1. A GWAS for type 2 diabetes identifies a SNP at genome-wide significance (p = 3 × 10⁻¹²) in an intergenic region 200 kb from the nearest annotated gene. A journalist writes: "Scientists discover the type 2 diabetes gene." List three specific things wrong with this headline.

2. A researcher runs a GWAS with 5,000 cases and 5,000 controls and finds no genome-wide significant hits. A second researcher runs the same GWAS with 500,000 cases and 500,000 controls and finds 400 significant loci. Both studies used the same samples (the second was a meta-analysis that included the first). What does this tell you about the genetic architecture of the trait? What statistical concept explains the difference in results?

3. A polygenic score for coronary artery disease trained in a European cohort is applied to a Nigerian cohort. The score performs poorly — its predictions are near random. A colleague says this means CAD genetics differ fundamentally between Europeans and Nigerians. You disagree. What is the actual explanation, and what would you do to fix the score?

4. A GWAS finds a variant in the gene encoding a protein called X that is associated with reduced blood pressure. The same variant reduces expression of X by 30% in aortic tissue (from GTEx eQTL data). A pharmaceutical company wants to develop an inhibitor of protein X to treat hypertension. What additional evidence would you want before concluding this is a good drug target?

---

Key facts to remember

Complex traits are polygenic: thousands of variants, each tiny effect, collectively explaining large heritability
GWAS tests millions of SNPs for association with a trait; genome-wide significance threshold is p < 5 × 10⁻⁸
Most GWAS hits are in non-coding regulatory regions, not protein-coding sequence (connects to M5)
Tag SNPs are in LD with causal variants — the hit is rarely the causal variant itself
Manhattan plot: x = genomic position, y = -log₁₀(p-value), peaks above p < 5 × 10⁻⁸ are significant
Missing heritability: even large GWAS explain only a fraction of estimated heritability; rare variants, epistasis, GxE interaction are candidate explanations
Polygenic scores aggregate thousands of GWAS effect sizes; predictive at population level but not individual level
PGS performance drops 50–70% in non-European populations due to different LD patterns and allele frequencies (connects to M4)
Mendelian randomization uses genetic variants as natural instruments to test causal hypotheses and validate drug targets

---

Primary sources & references

Visscher, P. M. et al. (2017). "10 years of GWAS discovery." American Journal of Human Genetics, 101, 5–22.
Yengo, L. et al. (2022). "A saturated map of common genetic variants associated with human height." Nature, 610, 704–712.
Duncan, L. et al. (2019). "Analysis of polygenic risk score usage and performance in diverse human populations." Cell, 178, 1–12.
Boyle, E. A., Li, Y. I. & Pritchard, J. K. (2017). "An expanded view of complex traits: from polygenic to omnigenic." Cell, 169, 1177–1186.
Davey Smith, G. & Hemani, G. (2014). "Mendelian randomization: genetic anchors for causal inference in epidemiological studies." Human Molecular Genetics, 23, R89–R98.
Nelson, M. R. et al. (2015). "The support of human genetic evidence for approved drug indications." Nature Genetics, 47, 856–860.