Genomics3 of 1330 minModules 1–2 (you know what a genome is and what variants are)

How we read genomes — sequencing without the mystery

In Module 2 you learned that any two humans differ at roughly 4–5 million positions.

Start here

In Module 2 you learned that any two humans differ at roughly 4–5 million positions. But how do we actually find those positions? How do you go from a blood sample to a list of variants?

The answer is DNA sequencing — the technology that turned genomics from a 13-year, $3 billion project into a routine clinical test. Understanding how sequencing works isn't just trivia. It's the reason you'll know why a FASTQ file looks the way it does, why some variants get missed, why sequencing errors happen, and why the choice of sequencing platform changes what questions you can ask.

By the end of this module you should be able to answer:

How does Sanger sequencing work — and why is it still used?
What is next-generation sequencing and how did it change everything?
What is a FASTQ file and what is a quality score?
What is the difference between short-read and long-read sequencing?
What are the tradeoffs between whole genome, whole exome, and targeted sequencing?

---

Before sequencing: what you're actually working with

When a genomics lab sequences a person's DNA, they start with a biological sample — usually blood, saliva, or a tissue biopsy. From that sample, they extract DNA.

But here's the thing: you can't sequence a whole genome in one continuous read. The genome is 3.2 billion base pairs long, and no sequencing technology can read that from end to end in one pass. Every sequencing approach fragments the DNA into smaller pieces first, sequences those pieces, and then computationally reassembles them into a complete picture.

The size of those fragments, how they're sequenced, and how the assembly is done — that's what distinguishes different sequencing technologies from each other.

---

Sanger sequencing: the original method

Sanger sequencing was developed by Frederick Sanger in 1977. It won him his second Nobel Prize in Chemistry. The same basic method was used to sequence the first complete human genome in 2003 — a project that took 13 years and cost approximately $3 billion.

How it works:

You start with a single-stranded DNA template — the sequence you want to read. You add:

A DNA polymerase (the enzyme that builds new DNA strands)
A primer (a short sequence that tells the polymerase where to start)
All four normal deoxyribonucleotides (dNTPs): dATP, dCTP, dGTP, dTTP
A small amount of four special chain-terminating nucleotides called ddNTPs (dideoxyribonucleotides) — one for each base, each labeled with a different fluorescent dye

When the polymerase incorporates a ddNTP instead of a regular dNTP, it can't add the next nucleotide — the chain terminates. Because ddNTPs are present in small amounts, termination happens at random positions across many copies of the reaction. The result is a mixture of DNA fragments of every possible length, each ending at a different position, each with a fluorescent tag that identifies which base is at the end.

These fragments are separated by capillary electrophoresis — they migrate through a gel at different speeds based on their length, smallest first. A laser detector reads the fluorescent colors as the fragments pass. The order of colors = the order of bases = the sequence.

Why Sanger sequencing is still used:

Sanger sequencing reads long, highly accurate sequences — up to about 900 base pairs in a single read, with accuracy around 99.99%. It's the gold standard for validating a specific variant found by other methods. When a clinical lab finds a concerning variant using next-generation sequencing, they often confirm it with Sanger sequencing before reporting the result.

Sanger is slow and expensive for whole-genome work — it would take years and millions of dollars to sequence a complete genome this way today. But for sequencing a single gene, a single exon, or validating a specific variant, it remains the most reliable method available.

---

The revolution: next-generation sequencing

Next-generation sequencing (NGS) — also called high-throughput sequencing or massively parallel sequencing — emerged in the mid-2000s and fundamentally changed what genomics could do.

The core insight: instead of sequencing one DNA fragment at a time (Sanger), sequence millions of fragments simultaneously.

How Illumina sequencing works (the dominant NGS platform):

Illumina's sequencing-by-synthesis approach is currently used in the majority of clinical and research genomic sequencing worldwide. Here's the process:

Step 1: Library preparation The input DNA is fragmented into pieces, typically 300–500 base pairs long. Short adapter sequences are attached to both ends of every fragment. These adapters serve two purposes: they allow the fragments to bind to the sequencing flow cell, and they contain barcode sequences that identify which sample a fragment came from (allowing multiple samples to be sequenced together — a process called multiplexing).

Step 2: Cluster generation The adapter-tagged fragments are loaded onto a flow cell — a glass slide coated with oligonucleotides complementary to the adapters. Each fragment binds to the surface and is amplified in place by a process called bridge amplification, creating a cluster of ~1,000 identical copies of that fragment. A flow cell typically contains hundreds of millions of clusters.

Step 3: Sequencing by synthesis All clusters are sequenced simultaneously. Each cycle, a fluorescently labeled nucleotide is incorporated into the growing strand. The color of the fluorescent signal identifies which base was added. After imaging, the fluorescent label is chemically removed and the next cycle begins. This is repeated for the length of the read — typically 150 bases from each end of the fragment (paired-end sequencing).

Step 4: Data output The result is hundreds of millions of short reads, each 150 base pairs long. This raw data is stored in FASTQ format (covered in section 3.5). A single whole human genome sequencing run produces roughly 100 gigabytes of raw FASTQ data.

What this changed:

In 2001, sequencing a human genome cost $100 million. By 2007, Illumina's first NGS instruments brought the cost to about $1 million. By 2014, the $1,000 genome milestone was reached. Today, whole genome sequencing costs approximately $200–600 depending on coverage depth and lab.

This cost reduction — roughly a million-fold in 20 years — is faster than Moore's Law. It's one of the most dramatic technology cost curves in history. It transformed genomics from something only large research consortia could do into something clinical labs run routinely.

---

Coverage depth: how many times do you sequence each position?

When you sequence a genome, you don't read each position exactly once. The sequencing fragments are randomly distributed — some positions of the genome will be covered by more reads than others, just by chance.

Coverage depth (or simply "depth" or "coverage") describes how many times, on average, each position in the genome is sequenced. A 30x coverage genome has been sequenced to a depth where the average position is covered by 30 independent reads.

Why does depth matter?

Sensitivity to heterozygous variants: If you're looking for a heterozygous SNP — present on one of two chromosomes — roughly half of the reads at that position will show the reference allele and half will show the variant allele. At 1x coverage, you'd have one read at that position — you'd either see the variant or not, essentially by chance. At 30x, you'd expect about 15 reads showing the variant and 15 showing the reference — a clear signal.

Somatic variant detection: Tumor DNA is a mixture of cancer cells and normal cells. A somatic variant present in 5% of cancer cells will only appear in 5% of reads. Detecting that requires very high depth — 100x, 500x, or even higher for liquid biopsy applications where tumor DNA is diluted in blood.

Standard coverage levels in practice:

Whole genome sequencing (research): 30x
Whole genome sequencing (clinical): 30–40x
Whole exome sequencing: 100x (higher because the exome is smaller and requires confident variant calls)
Targeted gene panel: 500–1000x
Liquid biopsy (detecting circulating tumor DNA): 1000x or higher

---

FASTQ files: the raw data format

Every NGS sequencing run produces FASTQ files. Understanding what a FASTQ file contains is foundational for anyone doing bioinformatics work.

A FASTQ file stores sequence reads along with their quality scores. Each read takes up exactly four lines:

` @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IIB< `

Line 1: The read identifier — starts with @, followed by a unique name for this read. Contains information about the sequencing run, flow cell, lane, tile, and coordinates.

Line 2: The actual DNA sequence. Each character is one base: A, T, C, G, or N (N means the sequencer couldn't determine the base).

Line 3: A + sign (separator). Sometimes the identifier is repeated here; usually just +.

Line 4: The quality scores — one character per base, same length as line 2.

What are quality scores?

Each base call has an associated quality score that reflects the confidence of the sequencer in that base call. Quality scores are encoded as Phred scores (also called Q scores), named after the Phred base-calling software developed in the 1990s.

The Phred score Q is defined as:

Q = -10 × log₁₀(P)

where P is the estimated probability that the base call is wrong.

Phred Score	Error Probability	Accuracy
Q10	1 in 10	90%
Q20	1 in 100	99%
Q30	1 in 1,000	99.9%
Q40	1 in 10,000	99.99%

In practice, Illumina reads typically achieve Q30 or better for most bases. Quality tends to drop toward the end of reads — the chemistry degrades over many cycles.

In the FASTQ file, quality scores are encoded as ASCII characters (to keep the file as text rather than binary). The character I in the example above corresponds to a Phred score of 40 (very high quality). The character < corresponds to lower quality.

Before doing any analysis with sequencing data, a quality control step filters out low-quality reads and trims low-quality bases from read ends. The tool FastQC is the standard first step in any NGS analysis pipeline.

---

From reads to variants: the bioinformatics pipeline

Raw FASTQ reads are not immediately useful. They need to be processed through a bioinformatics pipeline to identify variants. Here's the standard pipeline in brief — you'll understand each step better after the Build modules, but the conceptual overview matters now.

Step 1: Quality control FastQC assesses read quality. Reads below a quality threshold are discarded. Low-quality bases at read ends are trimmed. This is non-negotiable — garbage in, garbage out.

Step 2: Alignment Each read is aligned to the reference genome (GRCh38) using an aligner such as BWA-MEM or HISAT2. The aligner finds the position in the 3.2-billion-base-pair reference where each read best matches. Output: a SAM file (Sequence Alignment Map), usually converted to the compressed binary BAM format.

This step is computationally intensive. Aligning 300 million reads from a 30x whole genome to the reference takes hours on a standard server.

Step 3: Duplicate marking Library preparation amplifies DNA fragments — the same original fragment can generate multiple identical reads. These PCR duplicates are identified and marked (not counted for variant calling) because they don't represent independent evidence.

Step 4: Variant calling Software scans the aligned reads and identifies positions where the reads differ from the reference. GATK (Genome Analysis Toolkit) from the Broad Institute is the standard tool for germline variant calling. For somatic variants in cancer, Mutect2 is widely used.

Output: a VCF file (Variant Call Format) — a structured text file listing every variant found, its position, the reference and alternate alleles, and quality metrics.

Step 5: Variant annotation Raw variants have no biological interpretation. Annotation tools (VEP, ANNOVAR, SnpEff) add information: which gene is affected, what effect the variant has on the protein, population frequency in gnomAD, clinical classifications from ClinVar.

This annotated VCF is what clinical and research genomicists actually interpret.

---

Short-read vs. long-read sequencing

Illumina's NGS produces short reads — 150 base pairs per read. This has served genomics well for a decade, but short reads have real limitations.

The problem with short reads:

The human genome contains large repetitive regions — Alu elements, satellite DNA, and other sequences that appear thousands of times across the genome. When you align a 150bp read that matches a repetitive sequence, you can't tell which copy of that repeat it came from. These regions are effectively invisible to short-read sequencing.

Short reads also struggle with:

Large structural variants (inversions, translocations) that span distances longer than a few hundred base pairs
Phasing — determining which variants are on the same chromosome (haplotype) vs. different chromosomes
The centromeric and telomeric regions of chromosomes, which are almost entirely repetitive

Long-read sequencing:

Two platforms have made long-read sequencing practical:

Oxford Nanopore Technologies (ONT) passes DNA through a protein nanopore. As each base translocates through the pore, it disrupts an ionic current in a characteristic way. The current signal is decoded into a base sequence. ONT can read tens of thousands of base pairs in a single continuous read — and in principle can read an entire chromosome from end to end. The MinION sequencer fits in a pocket and costs $1,000. Disadvantage: higher raw error rate than Illumina (~5–10% per base vs. ~0.1% for Illumina), though accuracy has improved significantly with recent chemistry.

Pacific Biosciences (PacBio) uses a different approach: single-molecule real-time (SMRT) sequencing. DNA polymerase synthesizes a new strand in real time inside a tiny well called a zero-mode waveguide. Fluorescent signals from incorporated nucleotides are detected as the polymerase works. PacBio produces reads of 10,000–25,000 base pairs with high accuracy (especially with HiFi reads, which sequence the same molecule multiple times for error correction). Disadvantage: lower throughput and higher cost per base than Illumina.

Why long reads matter:

The first truly complete human genome — with no gaps — was published in 2022 by the Telomere-to-Telomere (T2T) Consortium. It added 200 million base pairs of previously unsequenced sequence to the human reference, almost entirely from centromeric and other repetitive regions. This was only possible with long-read sequencing. Short reads couldn't assemble those regions.

Long reads are increasingly used for:

Detecting structural variants that short reads miss
Characterizing repeat expansion disorders (like Huntington's disease or fragile X)
Phasing variants to determine haplotype structure
Sequencing genomes where the reference is poorly assembled

---

Whole genome vs. whole exome vs. targeted panel

When ordering sequencing for clinical or research purposes, the choice of sequencing approach determines what you can and can't find.

Whole genome sequencing (WGS) Sequences all 3.2 billion base pairs. Finds variants in coding regions, regulatory elements, introns, and intergenic regions. Detects CNVs, structural variants, and repeat expansions alongside SNPs and indels.

Advantages: comprehensive; doesn't require prior knowledge of which genes to look at; covers regulatory and intronic variants; structural variant detection included.

Disadvantages: higher cost; produces massive amounts of data; most variants found in non-coding regions have uncertain significance; deeper analysis required.

Whole exome sequencing (WES) Sequences only the exome — the ~1–2% of the genome that encodes protein. Achieves ~100x coverage on coding regions. Misses regulatory, intronic, and structural variants.

Advantages: cheaper than WGS; higher depth on coding regions; most clearly interpretable disease-causing variants are in coding sequence.

Disadvantages: misses intronic splice variants, promoter mutations, CNVs, and structural rearrangements that don't affect coding sequence.

Targeted gene panels Sequences only specific genes known to be relevant to a clinical question. A hereditary cancer panel might include 30–80 genes including BRCA1, BRCA2, MLH1, MSH2, TP53, and others. Achieves very high depth (500–1000x) on those specific genes.

Advantages: cheapest; fastest turnaround; highest depth and sensitivity for variants in targeted genes; easiest to interpret results.

Disadvantages: misses everything outside the panel; doesn't help when the diagnosis is unknown; panel design determines what can be found.

Which to choose: In clinical practice: targeted panels for specific clinical questions (hereditary cancer risk, pharmacogenomics); WES for undiagnosed rare diseases where the causative gene is unknown; WGS increasingly for cases where WES is unrevealing or where structural variants are suspected.

In research: WGS is the gold standard. WES remains common for large cohort studies where cost is limiting.

---

The $200 genome and what it actually costs

"The $200 genome" refers to the cost of sequencing reagents on high-throughput Illumina instruments at large genomics centers. It is real but incomplete.

The $200 figure covers: sequencing reagents, the flow cell, and instrument depreciation.

It does not cover: DNA extraction, library preparation, sequencing informatics (compute, storage, software licenses), variant interpretation, clinical reporting, or labor.

The total cost of a clinically reported whole genome sequence at a hospital lab is typically $1,000–$3,000. The total cost of a clinical exome with interpretation and report is $1,000–$2,500. Targeted panels run $300–$1,500 depending on the genes and depth.

This matters for understanding access to genomic medicine. Even at $200 in reagents, the infrastructure required — high-performance computing clusters for alignment and variant calling, petabyte-scale storage for raw data, trained genomicists for interpretation — means clinical genomics is concentrated in well-resourced academic medical centers and specialty labs. The cost curve is still falling. But sequencing and interpretation are not the same thing, and the bottleneck has shifted from generating data to making sense of it.

---

Key Terms

Term	Definition
Sanger sequencing	Chain-termination sequencing method; reads up to ~900bp; gold standard for validation
ddNTP	Dideoxynucleotide; chain-terminating nucleotide used in Sanger sequencing
Next-generation sequencing (NGS)	Massively parallel sequencing; sequences millions of fragments simultaneously
Illumina	Dominant short-read NGS platform; sequencing-by-synthesis; 150bp reads
Library preparation	Process of fragmenting DNA and attaching adapters before sequencing
Coverage depth	Average number of times each genomic position is sequenced
FASTQ	Standard file format for storing raw sequencing reads + quality scores
Phred score (Q score)	Quality score: Q = -10 × log₁₀(error probability). Q30 = 99.9% accuracy
SAM/BAM	Sequence Alignment Map; file format storing reads aligned to a reference
VCF	Variant Call Format; file storing called variants with position and quality info
GATK	Genome Analysis Toolkit; standard variant calling software (Broad Institute)
Short-read sequencing	Reads ~150bp; Illumina; high accuracy, limited for repetitive regions
Long-read sequencing	Reads 10,000–100,000+bp; Oxford Nanopore or PacBio; better for repeats and structural variants
Whole genome sequencing	Sequences all 3.2 billion base pairs
Whole exome sequencing	Sequences only protein-coding regions (~1–2% of genome)
Targeted panel	Sequences specific genes at high depth
T2T Consortium	Telomere-to-Telomere; published first complete, gapless human genome in 2022

---

Check Your Understanding

Answer these without looking back — then revisit any you're unsure of.

Why does Sanger sequencing produce fragments of every possible length — and how does that tell you the sequence?
A variant is present in only 8% of reads at a given position in a tumor sample. Does this mean the variant call is an error? What's a more likely explanation?
You have a FASTQ read where the quality score for one base is the ASCII character corresponding to Q10. Should you trust that base call? Why or why not?
A researcher wants to find the splice site variant that's causing a patient's undiagnosed disease. The patient has already had a targeted gene panel and whole exome sequencing, both negative. What sequencing approach would you recommend next, and why?
The T2T Consortium needed long-read sequencing to complete the human genome. What specifically about centromeric regions makes short-read sequencing fail there?
A sequencing lab advertises "the $200 genome." A patient asks why their insurance is being billed $2,200. What's the honest explanation? ---

Where this takes you

You now know how a genome is sequenced and what the raw data looks like. But sequencing produces reads that need to be aligned to something — a reference. Module 4 asks: what exactly is that reference, who's in it, what's missing from it, and why does that matter for every analysis you'll ever do?

→ Module 4: Reference genomes — and who's missing from them

---

Primary sources:

Sanger, F. et al. (1977). "DNA sequencing with chain-terminating inhibitors." PNAS, 74, 5463–5467.
Bentley, D.R. et al. (2008). "Accurate whole human genome sequencing using reversible terminator chemistry." Nature, 456, 53–59.
Nurk, S. et al. (2022). "The complete sequence of a human genome." Science, 376, 44–53. (T2T paper)
Alkan, C. et al. (2011). "Genome structural variation discovery and genotyping." Nature Reviews Genetics, 12, 363–376.