Genomics1 of 1335 minBasic high school biology

What a genome actually is

You've probably heard "genome" used interchangeably with "DNA," "genetic code," and "blueprint of life." All three framings are imprecise in ways that will trip you up once you start doing real work.

Start here

You've probably heard "genome" used interchangeably with "DNA," "genetic code," and "blueprint of life." All three framings are imprecise in ways that will trip you up once you start doing real work. This module builds the accurate mental model — the one researchers use — before you touch any tool.

By the end of this module you should be able to answer:

What exactly is a genome, in precise terms?
Why is the "blueprint" metaphor wrong — and what's a better one?
What is the central dogma, and what are its exceptions?
What are genes, and why do they make up only about 2% of your genome?
What is the other 98% doing?

---

The blueprint metaphor is wrong

When the Human Genome Project completed the first human genome sequence in 2003, newspapers called it "the book of life" and "the blueprint of humanity." These metaphors stuck. They're also misleading enough that most molecular biologists avoid them.

Here's why "blueprint" doesn't work:

A blueprint is static. It describes a fixed structure. The genome is not static — it's dynamic. The same genome in a liver cell and a neuron produces completely different proteins and behaves in completely different ways. A blueprint doesn't change its instructions depending on what room it's in. Your genome does exactly that.

A blueprint is also deterministic. Given a blueprint, you can predict exactly what will be built. Given a genome, you cannot predict exactly what organism will develop. Identical twins have identical genomes. They are not identical people. Environment, development, chance, and epigenetics all shape what actually happens.

A better metaphor — used by biologist Evelyn Fox Keller and others — is a recipe book written in a language nobody fully reads yet. Recipes are instructions, not fixed designs. The same recipe produces different results depending on who's cooking and what kitchen they're in. Recipes interact with each other. Some recipes are only active under certain conditions. And we have the recipe book in hand, but we're still figuring out what most of it says.

That metaphor isn't perfect either. But it captures something the blueprint metaphor misses: the genome is a set of instructions that are conditionally interpreted, not a fixed description of an outcome.

---

What a genome actually is

A genome is the complete set of DNA in one cell of an organism.

In humans, that means:

3.2 billion base pairs of DNA
Organized into 23 pairs of chromosomes (46 total)
Stored in the nucleus of almost every cell in your body
Plus a second, much smaller genome in your mitochondria (16,569 base pairs, inherited only from your mother)

Base pairs are the fundamental unit. DNA is a double helix — two strands wound around each other. Each position on one strand pairs with a complementary base on the other: Adenine (A) pairs with Thymine (T), and Cytosine (C) pairs with Guanine (G). When genomicists describe genome size in "base pairs," they're counting those paired positions.

3.2 billion is hard to make intuitive. Two ways to feel it:

If you printed the human genome as text — one letter per base pair, no spaces — it would fill about 1.5 million pages. That's roughly 6,000 books at 250 pages each.

If you stretched the DNA from a single human cell end to end, it would reach about 2 meters. You have approximately 37 trillion cells. The total DNA in your body, fully extended, would reach the sun and back more than 60 times.

One more thing worth knowing: the human genome is not the largest genome known. The Paris japonica plant has a genome about 50 times the size of ours. Genome size does not correlate with organism complexity — this is called the C-value paradox, and it was one of the early clues that most DNA is not protein-coding.

---

The central dogma — and its exceptions

In 1958, Francis Crick proposed what he called the "central dogma of molecular biology":

DNA → RNA → Protein

Genetic information flows from DNA to RNA through transcription, and from RNA to protein through translation. The dogma also states that information cannot flow backwards from protein to DNA.

This is the core of how genomes do anything. The sequence of bases in DNA is transcribed into a messenger RNA (mRNA) molecule, which ribosomes then read to assemble a protein. Proteins do essentially everything: they form structures, catalyze reactions, carry signals, and regulate which other genes get expressed.

The central dogma is correct in its essentials. But Crick himself acknowledged it wasn't absolute, and subsequent discoveries have added important nuance:

Reverse transcription — HIV and other retroviruses carry RNA as their genetic material and use an enzyme called reverse transcriptase to write that RNA back into DNA. Information can flow from RNA to DNA. HIV exploits this to permanently integrate its genome into the host cell's DNA.

RNA as a final product — Not all RNA becomes protein. MicroRNAs, long non-coding RNAs, and ribosomal RNAs are all transcribed from DNA but never translated. They perform essential regulatory and structural functions as RNA molecules directly.

Prions — Prion proteins can propagate their shape without any nucleic acid intermediary. A misfolded protein induces other copies to misfold. This is information transfer from protein to protein — entirely outside the central dogma. Prions cause diseases including Creutzfeldt-Jakob disease in humans and BSE (mad cow disease) in cattle.

These exceptions don't undermine the central dogma — they reveal that the genome is not the only repository of heritable information, and that understanding what a genome does requires understanding its relationship to a broader cellular context.

---

Genes: only 2% of the story

Here is a number that surprises almost everyone: protein-coding genes make up approximately 1.5–2% of the human genome.

The human genome contains somewhere between 19,000 and 25,000 protein-coding genes (the number is still being revised as annotation improves). Those genes, plus the regulatory sequences immediately around them, account for a small fraction of 3.2 billion base pairs.

For decades after the discovery of DNA's structure, the rest was called "junk DNA" — a phrase coined by geneticist Susumu Ohno in 1972. The assumption was that non-coding DNA was mostly evolutionary debris: broken genes, repetitive elements, and random sequence that had accumulated over millions of years of evolution without consequence.

That assumption was wrong. Here's what the other 98% actually contains:

Regulatory elements (~8–9% of the genome) These sequences control when, where, and how much genes are expressed:

Promoters: sequences just upstream of genes where the transcription machinery binds to initiate transcription
Enhancers: sequences that can sit thousands of base pairs away from the gene they regulate, but physically loop to the promoter to increase transcription
Silencers: sequences that suppress transcription
Insulators: sequences that block enhancers from acting on neighboring genes

Mutations in regulatory elements cause disease just as mutations in coding sequences do. Disease-associated variants are enriched in regulatory regions — natural selection acts on regulatory DNA as intensely as on coding DNA.

Introns (~25–35% of the genome) Most human genes are interrupted by non-coding sequences called introns. When a gene is transcribed into pre-mRNA, the introns are spliced out and the coding exons are joined to form the mature mRNA. The intronic sequences are discarded.

Why do introns exist? Several reasons: they enable alternative splicing (one gene can produce multiple different protein variants depending on which exons are included), they contain regulatory elements, and they may have provided raw material for evolutionary innovation. But their full significance is still an active area of research.

Repetitive elements (~50% of the genome) About half of the human genome is repetitive DNA:

SINEs (Short Interspersed Nuclear Elements): ~1 million copies of sequences like Alu elements, each about 300 bp. Alu elements alone account for about 10% of the entire human genome.
LINEs (Long Interspersed Nuclear Elements): ~500,000 copies, including LINE-1 elements that can occasionally still "copy and paste" themselves into new positions
Satellite DNA: highly repetitive sequences concentrated near centromeres and telomeres

Most of these are derived from ancient viruses and transposable elements that integrated into our ancestors' genomes millions of years ago and were copied throughout the genome ever since. Approximately 8% of the human genome is derived from ancient retroviruses — more than the percentage that codes for proteins.

Non-coding RNA genes (~2–3% of the genome) Genes transcribed into RNA but never translated into protein:

microRNAs (miRNAs): small RNAs (~22 nucleotides) that regulate gene expression by binding to mRNA and either preventing translation or triggering degradation
long non-coding RNAs (lncRNAs): transcripts longer than 200 nucleotides with no protein-coding function; involved in epigenetic regulation and nuclear organization, among other roles not yet fully understood
rRNA and tRNA: essential components of the translation machinery itself

The ENCODE project — a large international consortium — concluded in 2012 that approximately 80% of the genome has at least some biochemical function. That claim remains controversial because "biochemical activity" is not the same as "functional in the evolutionary sense." The debate over what fraction of the genome is genuinely functional is still active.

The honest answer to "how much of the genome is junk?" is: we don't know precisely. Less than was assumed. Possibly much less.

---

Getting the vocabulary straight

You'll encounter several related terms constantly. Here's what each actually refers to:

Genome — the complete DNA sequence of an organism. All 3.2 billion base pairs. Whole genome sequencing sequences all of it.

Exome — the coding portions of the genome: the exons that end up in mature mRNA and are translated into protein. In humans, approximately 1–2% of the genome, or ~45–50 million base pairs. Whole exome sequencing is cheaper than whole genome sequencing and captures most disease-causing coding variants — but misses regulatory and intronic variants.

Transcriptome — the complete set of RNA molecules in a cell at a given time. Unlike the genome, which is essentially the same in every cell, the transcriptome varies dramatically by cell type and condition. A liver cell and a neuron have identical genomes but radically different transcriptomes. RNA sequencing (RNA-seq) measures the transcriptome.

Proteome — the complete set of proteins in a cell at a given time. Even more variable than the transcriptome. Proteomics is harder than genomics because proteins are more chemically diverse and there is no direct amplification method equivalent to PCR.

Epigenome — chemical modifications to DNA and histone proteins that affect gene expression without changing the DNA sequence. DNA methylation (adding a methyl group to cytosines, which typically silences nearby genes) and histone modifications are the primary mechanisms. The epigenome is heritable across cell divisions and partially heritable across generations.

Understanding which level of biology a study is measuring — genomic, transcriptomic, proteomic, epigenomic — is essential for interpreting its conclusions correctly.

---

Why this matters before you touch a tool

Every tool in this track operates on a piece of this picture.

BLAST aligns DNA or protein sequences against a reference database. To use it correctly, you need to know whether you're working with DNA, RNA, or protein — and why those are different queries requiring different programs.

ClinVar stores variant annotations — changes in the coding sequence or regulatory regions of genes. To interpret a result, you need to understand what kind of variant it is and where in the gene it falls.

gnomAD reports allele frequencies across populations. To read those frequencies correctly, you need to understand why frequency differs by ancestry — which requires understanding population genetics.

The Ensembl genome browser displays everything simultaneously: genes, regulatory elements, variants, conservation across species. To navigate it without getting lost, you need to know what each layer represents.

None of that is possible if your mental model of a genome is "the blueprint of life." All of it becomes tractable if your model is what you've read here: a 3.2-billion-base-pair dynamic instruction set, 98% of which isn't protein-coding, interpreted differently in every cell type, partially derived from ancient viruses, shaped by billions of years of evolution, and still only partially understood.

---

Key Terms

Term	Definition
Base pair (bp)	A paired unit of nucleotides on opposite DNA strands (A-T or C-G)
Central dogma	Genetic information flows DNA → RNA → Protein
Gene	A DNA sequence encoding a functional RNA or protein product
Exon	Coding portions of a gene retained in mature mRNA
Intron	Non-coding sequences within a gene, spliced out before translation
Regulatory element	Non-coding sequences (promoters, enhancers, silencers) controlling gene expression
Transposable element	DNA sequences that can change position within a genome; ~50% of the human genome
Genome	Complete DNA content of one cell
Exome	Protein-coding portion of the genome (~1–2% of the human genome)
Transcriptome	Complete set of RNA transcripts in a cell at a given time
Epigenome	Chemical modifications to DNA and histones that regulate expression without altering sequence
C-value paradox	Genome size does not correlate with organism complexity

---

Check Your Understanding

Answer these without looking back — then revisit any you're unsure of.

What does "3.2 billion base pairs" mean in physical terms?
Why is the "blueprint" metaphor for the genome misleading?
State the central dogma and name one exception to it.
Protein-coding genes make up approximately what fraction of the human genome?
Name three categories of functional non-coding DNA and describe what each does.
What is the difference between the genome and the transcriptome — and why does that difference matter? ---

Where this takes you

If the genome contains 3.2 billion base pairs, and any two humans share about 99.9% of that sequence, what is the 0.1% that differs — and what does it do?

That 0.1% is where genomic medicine lives.

→ Module 2: How genomes differ — variants explained from scratch

---

Primary sources:

ENCODE Project Consortium (2012). Nature, 489, 57–74. Open access.
Crick, F. (1970). Nature, 227, 561–563.
International Human Genome Sequencing Consortium (2001). Nature, 409, 860–921. Open access.