Genomics5 of 1340 minModules 1–4 (you know what a genome is, what variants are, how sequencing works, and what a reference genome is)

Gene expression — how the genome decides what to do

You've learned what a genome is, how to read it, and how to interpret variation in it.

Start here

You've learned what a genome is, how to read it, and how to interpret variation in it. But here's a question that hasn't come up yet: if every cell in your body contains the same genome, why is a neuron different from a liver cell?

The answer is gene expression — the process by which specific genes are turned on or off in specific cells at specific times. A liver cell and a neuron have identical DNA, but they express very different subsets of it. The genome is the instruction manual; gene expression is the decision about which instructions to follow.

Understanding gene expression is fundamental to genomics because most diseases don't just involve mutations in DNA — they involve changes in when and how much specific genes are expressed. Cancer is largely a disease of dysregulated gene expression. Most GWAS variants linked to complex disease aren't in genes — they're in regulatory regions that control expression. And the emerging field of RNA therapeutics targets expression directly.

By the end of this module you should be able to answer:

What is the central dogma and why does it matter?
How does a cell go from DNA sequence to functional protein?
What controls which genes are expressed in which cells?
What is epigenetics and how does it differ from genetics?
How do scientists measure gene expression at scale?

---

The central dogma — the flow of genetic information

In 1958, Francis Crick articulated what he called the "central dogma" of molecular biology: genetic information flows in one direction.

DNA → RNA → Protein

This is the core logic of how genomes work. DNA is the storage medium — stable, double-stranded, protected in the nucleus. RNA is the messenger — a working copy of a gene's instructions that can be transported out of the nucleus. Protein is the functional output — the molecule that actually does something in the cell.

There are two steps:

Transcription: DNA is read by an enzyme called RNA polymerase, which synthesizes a complementary RNA strand. This RNA copy of a gene is called messenger RNA (mRNA).

Translation: The mRNA is read by a ribosome, which uses the sequence to build a protein. Groups of three nucleotides (called codons) each specify one amino acid. The ribosome moves along the mRNA, bringing in the correct amino acid for each codon via transfer RNA (tRNA), and linking them into a growing protein chain.

The central dogma is sometimes misstated as "DNA makes RNA makes protein, always, in one direction." That's mostly true but has important exceptions:

Retroviruses like HIV use an enzyme called reverse transcriptase to go from RNA back to DNA
Epigenetic inheritance can transmit information that isn't encoded in DNA sequence (more on this later)
Some RNAs are functional without ever becoming protein — these non-coding RNAs regulate other genes

Still, for the vast majority of human genes, DNA → RNA → Protein is the logic.

---

Transcription — how DNA becomes RNA

Transcription doesn't happen randomly across the genome. It requires specific machinery and signals to start and stop at the right places.

The promoter: Every gene has a promoter — a DNA sequence upstream of the gene that acts as the "start here" signal for RNA polymerase. Promoters contain specific sequence motifs (like the TATA box) that are recognized by general transcription factors, which then recruit RNA polymerase to the right position.

Transcription factors (TFs): These are proteins that bind to specific DNA sequences and either activate or repress transcription of nearby genes. The human genome encodes roughly 1,600 transcription factors. They are the primary mechanism by which cells control which genes are expressed. A liver cell expresses different transcription factors than a neuron, which causes different genes to be transcribed in each cell type.

Enhancers and silencers: Beyond promoters, transcription is also regulated by distant regulatory sequences called enhancers (which increase transcription) and silencers (which decrease it). These can be located thousands or even hundreds of thousands of base pairs away from the gene they regulate, and they act by physically looping the DNA so the enhancer comes into contact with the promoter.

This is critical for understanding GWAS results: the vast majority of disease-associated variants from genome-wide association studies fall in non-coding regions of the genome — often in enhancers. They don't change protein sequence; they change how much of a protein gets made, or in which cells, or at what time.

RNA processing: When RNA polymerase first synthesizes RNA from a gene, the product is a pre-mRNA that requires significant processing before it becomes functional:

5' cap: A modified nucleotide is added to the beginning of the mRNA, protecting it from degradation and helping ribosomes recognize it
Poly-A tail: A string of ~200 adenine nucleotides is added to the end, also protecting stability and signaling for export from the nucleus
Splicing: Human genes are interrupted by non-coding sequences called introns. The coding sequences are called exons. Splicing removes the introns and joins the exons together. This happens in a large RNA-protein complex called the spliceosome.

Alternative splicing is one of the most important sources of protein diversity in the human genome. By joining different combinations of exons, a single gene can produce multiple different mRNA isoforms — and therefore multiple different proteins. The human genome encodes ~20,000 protein-coding genes, but the proteome (the full set of proteins the body makes) is estimated to contain 80,000–400,000 distinct proteins. Alternative splicing is a major reason why.

---

Translation — how RNA becomes protein

Once a processed mRNA exits the nucleus, ribosomes translate it into protein.

The genetic code maps each three-nucleotide codon to a specific amino acid. There are 64 possible codons (4³) and only 20 amino acids, so the code is redundant — most amino acids are encoded by 2–6 different codons. Three codons are stop signals (UAA, UAG, UGA) that tell the ribosome to release the finished protein.

Translation proceeds in three phases:

Initiation: The ribosome assembles at the start codon (AUG, which codes for methionine)
Elongation: The ribosome moves along the mRNA codon by codon, with tRNA molecules delivering the appropriate amino acid at each step
Termination: The ribosome hits a stop codon and releases the completed polypeptide chain

The resulting protein then typically undergoes post-translational modifications — phosphorylation, glycosylation, cleavage, folding — before it reaches its final functional form. These modifications add another layer of complexity beyond what the DNA sequence alone encodes.

---

Gene regulation — the same genome, different decisions

If every cell has the same genome, the diversity of cell types in the body is entirely explained by differences in gene expression. A cardiomyocyte (heart muscle cell) expresses high levels of cardiac troponin. A retinal cell expresses opsins. A B cell expresses immunoglobulin genes. None of these cell types express those genes in each other — even though they all carry them.

This is controlled by a hierarchy of regulatory mechanisms:

Chromatin accessibility: In the nucleus, DNA is packaged around proteins called histones into a compact structure called chromatin. Tightly packed chromatin (heterochromatin) physically blocks transcription factors and RNA polymerase from accessing DNA. Loosely packed chromatin (euchromatin) is accessible. A major part of cell identity is the pattern of which genomic regions are open (accessible) vs. closed (inaccessible). This is measured using a technique called ATAC-seq.

Histone modifications: Histones can be chemically modified — most importantly by methylation and acetylation — which changes how tightly DNA wraps around them and recruits or repels transcription factors. Histone modifications act as a "histone code" that marks regions of the genome as active, poised, or repressed.

DNA methylation: A methyl group can be added directly to cytosine nucleotides in DNA (primarily at CpG dinucleotides). Methylation in promoter regions generally silences gene expression. DNA methylation patterns are heritable — they're maintained through cell division — and are established early in development.

Together, histone modifications and DNA methylation constitute the epigenome — a layer of regulatory information on top of the DNA sequence itself.

---

Epigenetics — a second layer of heritable information

The word "epigenetic" means "above genetics." Epigenetic marks are chemical modifications to DNA or histones that regulate gene expression without changing the underlying nucleotide sequence.

What makes epigenetics remarkable:

Epigenetic marks are cell-type specific — a liver cell and a neuron have the same DNA but different methylation patterns
Epigenetic marks are mitotically heritable — when a cell divides, its epigenome is largely copied to daughter cells, maintaining cell identity
Some epigenetic marks may be transgenerationally heritable — potentially transmitted from parent to offspring, though this is debated in humans
Epigenetic marks can be environmentally influenced — diet, stress, toxin exposure, and other environmental factors can alter methylation patterns

Imprinting is one well-established case of epigenetic inheritance: about 100 human genes are imprinted, meaning only the copy inherited from the mother or only the copy from the father is expressed, depending on the gene. This imprinting is maintained by methylation patterns established in the germline.

Epigenetics has become central to cancer biology: cancer cells globally lose methylation (hypomethylation) while locally gaining it at tumor suppressor gene promoters — a systematic rewriting of the epigenome that silences genes that would otherwise stop uncontrolled growth.

---

RNA-seq — measuring expression at scale

For decades, scientists studied one gene's expression at a time. The development of RNA sequencing (RNA-seq) made it possible to measure the expression of every gene in a sample simultaneously.

RNA-seq works by:

Extracting all RNA from a sample (tissue, cell culture, single cells)
Converting mRNA to complementary DNA (cDNA) using reverse transcriptase
Sequencing the cDNA library using the same next-generation sequencing technology you learned in Module 3
Counting how many reads map to each gene — this count is a proxy for expression level

The result is a transcriptome: the complete profile of which genes are expressed in a sample and at what level. Compare the transcriptomes of tumor vs. normal tissue and you find differentially expressed genes. Compare them across cell types and you get the regulatory logic of cell identity.

Single-cell RNA-seq (scRNA-seq) takes this further: instead of averaging expression across thousands of cells, it measures expression in each cell individually. This has transformed our understanding of tissue heterogeneity, discovering new cell types and states within tissues previously thought to be homogeneous.

Key RNA-seq concepts:

Counts/reads per gene: Raw measure of how many sequencing reads mapped to each gene
Normalization: Counts must be normalized for library size and gene length before comparison (common units: TPM, FPKM, DESeq2-normalized counts)
Differential expression (DE): Statistical test for which genes are expressed at significantly different levels between conditions
Pathway analysis: After DE testing, gene sets are mapped to biological pathways (GO terms, KEGG) to interpret results at the systems level

RNA-seq connects back to Module 3 in a direct way: it uses the same sequencing technology, but applied to RNA instead of DNA. The bioinformatics pipeline is also similar — FASTQ → alignment → quantification — but uses RNA-specific aligners (STAR, HISAT2) that handle splicing junctions that DNA aligners would miss.

---

Why gene expression matters for disease

Most of the genome variants associated with common complex diseases (heart disease, type 2 diabetes, schizophrenia) are not in protein-coding sequences. They're in regulatory regions — enhancers, silencers, promoters — where they alter gene expression levels rather than protein structure.

This means that to understand what a GWAS variant actually does, you often need expression data, not just sequence data. The field of expression quantitative trait loci (eQTL) analysis does exactly this: it identifies variants that are associated with changes in gene expression levels, linking genetic variation to regulatory function.

Large-scale efforts like GTEx (Genotype-Tissue Expression) have catalogued eQTLs across 54 human tissues, making it possible to ask: does this GWAS variant affect the expression of a nearby gene, in which tissues, and by how much?

This is how modern genomics moves from "associated with disease" to "here is the mechanism."

---

Check yourself

1. A gene produces five different protein isoforms, but the gene is 8 exons long. No mutations have been found in any patient with altered isoform ratios. What molecular mechanism most likely explains the isoform diversity? What would you investigate to understand changes in isoform ratios?

2. A GWAS for type 2 diabetes identifies a significant variant in a region with no annotated genes — it's 80,000 base pairs from the nearest gene. The variant is a single nucleotide change. Why is this variant still biologically interesting, and what experiment would you run to understand its function?

3. A researcher compares the transcriptomes of liver cells and neurons from the same individual. She finds thousands of differentially expressed genes. A colleague claims this means the two cell types have different genomes. You disagree. Explain the actual mechanism and what she should measure to understand why the expression profiles differ.

4. An RNA-seq experiment compares tumor vs. adjacent normal tissue. Gene X has 5,000 raw counts in the tumor and 4,500 in the normal. A second gene Y has 200 raw counts in tumor and 25 in normal. A student concludes Gene X is more differentially expressed. What is wrong with this reasoning?

---

Key facts to remember

Central dogma: DNA → RNA → Protein (with exceptions for retroviruses and non-coding RNA)
~20,000 protein-coding genes → 80,000–400,000 proteins, largely via alternative splicing
~1,600 transcription factors in the human genome; they are the primary drivers of cell-type-specific expression
Most GWAS variants fall in non-coding regulatory regions, not protein-coding sequences
Epigenetic marks (DNA methylation, histone modifications) regulate expression without changing sequence; they are cell-type specific and mitotically heritable
RNA-seq measures the complete transcriptome; scRNA-seq resolves expression at single-cell resolution
eQTL analysis links genetic variants to gene expression changes across tissues (GTEx = 54 tissues)

---

Primary sources & references

Crick, F. (1970). "Central dogma of molecular biology." Nature, 227, 561–563.
Pan, Q. et al. (2008). "Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing." Nature Genetics, 40, 1413–1415.
GTEx Consortium (2020). "The GTEx Consortium atlas of genetic regulatory effects across human tissues." Science, 369, 1318–1330.
Roadmap Epigenomics Consortium (2015). "Integrative analysis of 111 reference human epigenomes." Nature, 518, 317–330.
Schwartzman, O. & Tanay, A. (2015). "Single-cell epigenomics: techniques and emerging applications." Nature Reviews Genetics, 16, 716–726.
Buenrostro, J. D. et al. (2015). "ATAC-seq: A method for assaying chromatin accessibility genome-wide." Current Protocols in Molecular Biology, 109, 21.29.1–21.29.9.