Genomics2 of 1340 minModule 1 (you know what a genome is and roughly what genes do)

How genomes differ — variants explained from scratch

In Module 1 you learned that any two humans share about 99.9% of their genome sequence.

Start here

In Module 1 you learned that any two humans share about 99.9% of their genome sequence. That number is worth sitting with for a moment. 99.9% identical sounds like almost nothing differs. But 0.1% of 3.2 billion base pairs is 3.2 million positions where two people's genomes can differ. That's a lot of variation.

This module is about what that variation actually looks like — the types, sizes, and mechanisms of genomic difference — and crucially, how scientists classify whether a given variant causes disease, has no effect, or lands somewhere uncertain in between.

By the end of this module you should be able to answer:

  • What is the difference between a "variant" and a "mutation"?
  • What are SNPs, indels, CNVs, and structural variants?
  • What is the difference between germline and somatic variants?
  • How does a variant actually cause disease?
  • What does "pathogenic," "benign," and "variant of uncertain significance" mean — and why is VUS such a problem?

---

"Variant" vs "mutation" — why the word choice matters

In older genetics literature, changes in DNA sequence were almost always called "mutations." In contemporary genomics, the preferred term is variant.

Four scales of genomic variation A reference DNA sequence shown as colored base tiles, with four variant types beneath it: a SNP swaps one base, an indel inserts or deletes a base, a copy number variant duplicates a segment, and a structural variant inverts a segment. Variation ranges from one base pair to millions. ReferenceATGCATGGCASNP1 bpATGCATAGCAIndel1–few bpATGCAATGGCACNVkb to MbATGCATGATGGCAStructural≥50 bpATGGTACGCAViolet outline marks the change — variation spans a single letter to millions of base pairs.

This isn't just political correctness. It reflects a genuine conceptual shift.

"Mutation" carries an implicit connotation of abnormality — it suggests a deviation from a correct baseline. But there is no single "correct" human genome. Every person's genome differs from every other person's at millions of positions. Most of that variation is neutral or beneficial. Calling all of it "mutation" frames normal human diversity as pathology.

"Variant" is neutral. It means simply: a position in the genome where the sequence differs from the reference. Whether that difference is harmful, beneficial, or meaningless is a separate question.

You'll still encounter "mutation" in the literature, especially in cancer biology and older papers. In cancer contexts, "somatic mutation" remains common and appropriate because it refers specifically to acquired changes that drive tumor growth. But in clinical genetics, the shift toward "variant" is essentially complete.

The practical consequence: when you see the word "variant" in a ClinVar record or a clinical genetics report, it does not mean the person has a disease. It means a difference from the reference was found. What that difference means is the actual question.

---

SNPs: the most common type of variation

A SNP — pronounced "snip" — stands for Single Nucleotide Polymorphism.

A SNP is a position in the genome where a single base pair differs between individuals. At a given position, most people might have a G, but some people have an A instead. That substitution is a SNP.

Formally, a position is classified as a SNP if the less common variant — called the minor allele — has a frequency of at least 1% in at least one population. If the variant is rarer than 1%, it's usually called a rare variant or simply a variant, not a SNP. This threshold is somewhat arbitrary but useful for distinguishing common population-level variation from rare potentially disease-causing changes.

There are approximately 600–700 million SNP positions catalogued in the human genome as of current databases — meaning at 600 million positions, at least 1% of humans carry an alternative base. Any two unrelated people differ at roughly 4–5 million positions, most of which are SNPs.

Synonymous vs nonsynonymous SNPs

Not all SNPs have the same potential for consequence. Whether a SNP matters depends partly on where it falls and what it changes:

A synonymous SNP (also called a "silent" SNP) changes the DNA sequence but not the amino acid it encodes. This is possible because the genetic code is redundant: multiple different codons (three-base sequences) encode the same amino acid. For example, both GGA and GGG encode glycine — a mutation from GGA to GGG changes the DNA sequence but produces the same protein. Synonymous SNPs were long considered definitively neutral, but research has shown they can affect translation speed, RNA stability, and splicing — so "silent" is not always accurate.

A nonsynonymous SNP (also called a "missense" SNP) changes the DNA sequence and changes the resulting amino acid. A change from GGA (glycine) to GAA (glutamic acid) at the same position is missense. Whether this matters depends on whether that amino acid is functionally important.

A nonsense SNP creates a premature stop codon — a signal that halts translation early, producing a truncated protein. These are more likely to be deleterious than missense variants.

A splice site SNP occurs at the boundary between an exon and an intron. Splicing is directed by specific sequences at these boundaries. A SNP that disrupts a splice site can cause intron retention, exon skipping, or other splicing errors — all of which alter the final protein.

---

Indels: insertions and deletions

An indel is a variant where one or more base pairs are inserted into or deleted from the genome.

Why a frameshift is catastrophic The top row shows mRNA read in correct triplet codons, each making the right amino acid. The bottom row inserts a single base; the reading frame shifts so every codon after the insertion is misread into different amino acids, leading to a premature stop. NormalAUGMetGCAAlaUCASerGGUGlyFrameshift+1 baseAUGMetCinsCGCArgAUCIleAGGArgU…every codon after the insertion is misread → premature stop downstream

Small indels — one to a few base pairs — are the second most common type of variation after SNPs. Larger indels shade into the territory of structural variants (covered below).

The critical concept with indels in protein-coding sequence is the reading frame.

Proteins are encoded by codons: three-base sequences that each specify one amino acid. The ribosome reads mRNA three bases at a time, in frame. If you insert or delete a number of bases that is not a multiple of three, you shift the reading frame — every codon downstream of the indel is now different. The resulting protein has a completely altered amino acid sequence from the point of the indel onward and typically terminates prematurely at a new stop codon.

This is called a frameshift mutation, and it is almost always severely disruptive to protein function.

An indel of exactly 3 base pairs (or any multiple of 3) is in-frame — it adds or removes one or more amino acids without disrupting the reading frame. In-frame indels can be tolerated if they don't fall in a functionally critical region, or catastrophic if they do.

The most common mutation causing cystic fibrosis — F508del, the variant you'll see in gnomAD in Module 10 — is an in-frame 3-base-pair deletion in the CFTR gene that removes a single phenylalanine at position 508. This single amino acid loss prevents the CFTR protein from folding correctly. Nearly 70% of cystic fibrosis cases in people of Northern European ancestry are caused by this one variant.

---

Copy number variants: more or fewer copies

Copy number variants (CNVs) are regions of the genome where the number of copies differs from the standard two (one per chromosome in a diploid organism).

Normally, you have two copies of every autosomal gene — one on each chromosome in a homologous pair. A CNV means some people have one copy (a deletion), three copies (a duplication), or even more.

CNVs range from a few hundred base pairs to several megabases in size. They collectively account for more total base pairs of difference between individuals than SNPs do, even though SNPs are more numerous.

CNVs are clinically important:

  • Deletions can cause disease by reducing the dosage of a gene below what is functional. DiGeorge syndrome (also called 22q11.2 deletion syndrome) is caused by a deletion of roughly 3 million base pairs on chromosome 22, removing about 30–40 genes. It causes heart defects, immune deficiency, and developmental delays.
  • Duplications can cause disease by increasing gene dosage beyond what is tolerable. Charcot-Marie-Tooth disease type 1A is caused by a duplication of the PMP22 gene — one extra copy of this gene, producing too much PMP22 protein, damages the myelin sheath around peripheral nerves.
  • Many CNVs are benign — copy number variation in certain genomic regions is common and well tolerated.

CNVs are not captured by standard SNP-based sequencing approaches. Detecting them requires either array-based methods (SNP arrays or CGH arrays) or analysis of read depth in whole genome sequencing data — a different analytical approach from variant calling.

---

Structural variants: the large-scale rearrangements

Structural variants (SVs) are large-scale changes to genome organization — typically defined as variants involving at least 50 base pairs. They include:

Inversions — a segment of a chromosome is reversed in orientation. The DNA sequence is the same, but it reads in the opposite direction. Inversions can disrupt genes that span the inversion breakpoint and can affect regulatory elements. Some inversions are common polymorphisms with no apparent effect; others cause disease.

Translocations — a segment of one chromosome moves to a non-homologous chromosome. A reciprocal translocation involves exchange of segments between two chromosomes. A Robertsonian translocation involves fusion of two acrocentric chromosomes (chromosomes 13, 14, 15, 21, or 22) at their centromeres. Translocation carriers are often healthy themselves but have a higher risk of chromosome imbalance in offspring — Robertsonian translocations involving chromosome 21 are a common cause of hereditary Down syndrome.

The Philadelphia chromosome — the most famous translocation in medicine — is a reciprocal translocation between chromosomes 9 and 22 that creates the BCR-ABL fusion gene. This fusion produces a constitutively active tyrosine kinase that drives chronic myeloid leukemia (CML). It is the direct target of imatinib (Gleevec), one of the first molecularly targeted cancer drugs, which inhibits the BCR-ABL kinase specifically.

Complex rearrangements — some structural variants involve multiple simultaneous breaks and rearrangements across the genome. Chromothripsis ("chromosome shattering") describes events where a chromosome shatters into many pieces and is reassembled incorrectly. Initially thought to be rare, chromothripsis is now recognized in a significant fraction of cancer genomes.

---

Germline vs somatic variants — a distinction that changes everything

Every variant type described above can be either germline or somatic. This distinction is one of the most important in clinical genomics.

Germline vs somatic variants Two cell-lineage trees. On the left, a germline variant is present in the fertilized egg, so every cell in the body carries it — it is heritable and detectable in blood. On the right, a somatic variant appears in a single cell later in life, so only that cell's descendants (a tumor) carry it, and it cannot be inherited. Germline variantpresent from the fertilized eggin every cell · heritable · seen in a blood testSomatic variantappears in one cell, laterarises hereonly this lineage (a tumor) · not heritable

Germline variants are present in every cell of the body. They are inherited from one or both parents (or arise as new mutations in the egg or sperm — these are called de novo variants). Because germline variants are in the original DNA you were born with, they are present in every cell derived from that original cell — which means virtually every cell in your body.

When a clinical genetics lab sequences a patient's DNA (typically from blood), they are sequencing germline DNA. A pathogenic germline variant in BRCA1 means that person has an elevated lifetime risk of breast and ovarian cancer because that variant is in every cell, including their breast and ovarian epithelial cells.

Germline variants are heritable. If you carry a pathogenic BRCA1 variant, each of your biological children has a 50% chance of inheriting it.

Somatic variants arise in individual cells after conception, as a result of DNA replication errors, environmental damage (UV, carcinogens), or enzymatic activity. They are present only in the cell where they arose and its descendants — not in every cell of the body.

Cancer is fundamentally a disease of somatic variants accumulating in specific cells. A lung tumor acquires somatic variants that give those cells a growth advantage. Those variants are not present in the blood, the liver, or other tissues. They cannot be inherited.

This distinction has profound consequences for:

  • Testing: germline variants are detected from blood or saliva; somatic variants require tissue from the affected organ (a tumor biopsy)
  • Risk: a germline variant confers heritable risk; a somatic variant does not
  • Interpretation: a variant that is rare and disease-causing in the germline might be a common somatic event in cancer
  • Treatment: targeted cancer therapies often work by inhibiting the product of a specific somatic variant — imatinib targeting BCR-ABL is the textbook example

Mosaicism — a middle case — occurs when a de novo variant arises early in embryonic development, after the single-cell stage but before complete differentiation. The result is a person whose body contains two cell populations with different genomes. The proportion of cells carrying the variant depends on how early it arose. Mosaicism can affect somatic tissues, germline tissue, or both, and creates unusual clinical presentations and inheritance patterns.

---

How variants cause disease

Having a variant in a gene does not automatically mean the variant causes disease. Understanding the mechanisms by which variants disrupt gene function is essential for interpreting clinical genetics results.

Loss of function (LoF)

A loss-of-function variant reduces or eliminates the function of the gene product. Common LoF mechanisms include:

  • Nonsense variants (premature stop codon) → truncated protein
  • Frameshift indels → altered sequence and premature stop
  • Splice site variants → disrupted mRNA
  • Large deletions → gene absent entirely

Whether LoF causes disease depends on whether one functional copy is sufficient.

Haploinsufficiency occurs when one functional copy of a gene is not enough — the organism requires two functional copies for normal development or physiology. Haploinsufficiency is the mechanism underlying many dominant genetic diseases. BRCA1 haploinsufficiency increases cancer risk; having one nonfunctional copy leaves breast cells with reduced DNA repair capacity.

Recessive diseases require both copies to be nonfunctional. If you have one functional copy and one LoF copy, you are a carrier — typically healthy, but at risk of having affected children if your partner is also a carrier. Cystic fibrosis, sickle cell disease, and most inborn errors of metabolism are recessive.

Gain of function (GoF)

A gain-of-function variant creates a new or enhanced activity that the normal protein doesn't have. This is less common than LoF but often causes dominant disease because even one copy of the variant allele has an effect.

The RAS oncogenes are classic gain-of-function disease genes in cancer. A somatic missense variant in KRAS at position 12 locks the RAS protein in an active, GTP-bound state — it sends growth signals continuously rather than transiently. This drives cell proliferation. KRAS mutations are present in ~25% of all human cancers.

Dominant negative

Some variants produce a protein that not only loses its own function but actively interferes with the function of the normal protein produced by the other allele. This is called a dominant negative effect. Many transcription factors and structural proteins work as multimers — they function as complexes of multiple identical or related protein subunits. A mutant subunit that can still assemble into the complex but disrupts its activity can impair the entire complex, even when a normal copy is present.

Collagen disorders including osteogenesis imperfecta (brittle bone disease) often work through a dominant negative mechanism: one mutant collagen chain incorporated into a triple helix disrupts the entire helix.

---

The five-tier classification system

When a clinical genomics lab finds a variant, they need to communicate what they think it means. The American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) published guidelines in 2015 that established a five-tier classification system now used universally in clinical genetics:

The five-tier ACMG variant classification A spectrum from benign through likely benign, variant of uncertain significance (VUS), likely pathogenic, to pathogenic. Benign is green, pathogenic is coral-red, and the VUS in the middle is gray and emphasized as uncertain. Arrows show that a VUS can be reclassified toward either end as evidence accumulates. Benignnot disease-causingLikelybenignprobably harmlessVUSuncertain — can’t classifyLikelypathogenicprobably causalPathogenicdisease-causingharmlessharmfulVUS gets reclassified as evidence accumulatesVUS is clinical limbo — not a diagnosis, not an all-clear. 20–40% of rare variants land here,and patients from underrepresented populations receive VUS results far more often.

Pathogenic — the evidence strongly supports that this variant causes the indicated disease. A pathogenic classification means the lab is confident the variant is disease-causing.

Likely Pathogenic — the evidence supports pathogenicity but is not definitive. "Likely" in clinical genetics has a specific meaning: at least 90% probability of pathogenicity.

Variant of Uncertain Significance (VUS) — the evidence is insufficient or conflicting to classify the variant as pathogenic or benign. This is not a diagnosis and not a clean bill of health. It is a statement of uncertainty.

Likely Benign — the evidence supports that this variant does not cause the indicated disease.

Benign — the evidence strongly supports that this variant is not disease-causing.

The ACMG/AMP criteria use a points-based system incorporating multiple lines of evidence: population frequency (is the variant common in healthy populations?), functional studies (has the variant been shown to disrupt protein function in a lab assay?), computational predictions (do multiple algorithms predict the variant is damaging?), segregation data (does the variant co-segregate with disease in affected families?), and clinical reports (has this variant been seen in multiple unrelated affected individuals?).

---

The VUS problem

VUS — Variant of Uncertain Significance — deserves its own section because it is one of the most significant unsolved problems in clinical genomics, and one of the most distressing clinical situations in genetic medicine.

Consider what it means to receive a genetic test result that says "VUS":

You had testing because you or your family has a history of a disease — say, hereditary breast and ovarian cancer (HBOC). The test found a variant in BRCA1. The variant is classified as a VUS.

What do you do with that information?

You can't have a preventive mastectomy based on a VUS — the evidence doesn't support it. You can't be reassured that you're at normal risk — the evidence doesn't support that either. You're in clinical limbo. Surveillance recommendations for VUS carriers are uncertain. Insurance coverage for enhanced screening may be denied because there's no confirmed diagnosis.

Why are there so many VUS results?

Every person's exome contains roughly 12,000–15,000 variants relative to the reference genome when coding regions are analyzed. The vast majority are common variants seen in many people and classified as benign. But a significant fraction — often 20–40% of rare variants in clinical testing — cannot be classified with confidence.

This is a data problem. To classify a variant, you need:

  • Evidence that it causes functional disruption (lab studies, which are expensive and slow)
  • Evidence that it's enriched in affected individuals (large case-control datasets)
  • Evidence from family segregation studies (relatives available and willing to test)

For the rarest variants — never seen before, in populations with less research representation — none of this evidence exists. The variant is novel. It lands in the VUS category by default.

Who carries more VUS?

Because most genetic research has been conducted in people of European ancestry, databases of variant classifications — ClinVar, databases used for population frequency assessment — are overwhelmingly derived from European populations. A variant that is rare in the European population may be common and well-established as benign in a West African or East Asian population — but if those populations are underrepresented in databases, the variant gets classified as VUS.

A 2019 study in the New England Journal of Medicine demonstrated this directly: variants in BRCA1 and BRCA2 were classified as VUS at significantly higher rates in Black and Hispanic women than in white women — not because those variants were more likely to be harmful, but because the databases used to classify them contained less data from those populations.

This is one of the most concrete, documentable ways that genomic medicine perpetuates health disparities. The solution is more diverse research participation and more diverse population databases — which is why initiatives like the All of Us Research Program (enrolling one million participants with an explicit diversity mandate) matter.

VUS reclassification

VUS classifications are not permanent. As evidence accumulates — more affected individuals tested, more functional studies published, more families analyzed — VUS classifications are reclassified up (to pathogenic) or down (to benign). Laboratories are expected to maintain variant classifications and issue updated reports when new evidence supports reclassification.

This means a clinical genetic test result from five years ago may have a different interpretation today. Laboratories typically recommend periodic re-review of VUS results.

---

Zygosity: homo-, hetero-, and hemi-

One more term you'll encounter constantly in variant databases: zygosity describes how many copies of a variant an individual carries.

Heterozygous: the variant is present on one chromosome of a pair, but the other chromosome has the reference sequence. Represented as one copy.

Homozygous: the variant is present on both chromosomes of the pair — both copies carry the same variant. This typically happens when both parents are carriers of the same recessive variant.

Compound heterozygous: two different variants are present in the same gene, one on each chromosome. Both copies of the gene are non-functional, but through different variants. Compound heterozygosity is a common mechanism in recessive disease.

Hemizygous: only one copy of the gene exists, so there's only one allele to consider. This applies to genes on the X chromosome in males (who have only one X) and to genes in regions where one copy has been deleted. For X-linked genes, a male who is hemizygous for a pathogenic variant will be affected, even if the disease would be recessive in a female who has two X chromosomes.

---

Key Terms

TermDefinition
VariantA difference from the reference sequence; neutral in implication
SNPSingle Nucleotide Polymorphism; a position where a single base differs; minor allele frequency ≥1%
IndelInsertion or deletion of one or more base pairs
FrameshiftAn indel that is not a multiple of 3 bp, disrupting the reading frame
CNVCopy Number Variant; deletion or duplication of a genomic segment
Structural variantLarge-scale rearrangement (inversion, translocation, etc.); typically ≥50 bp
Germline variantPresent in every cell; inherited or de novo; heritable
Somatic variantAcquired in a specific cell after conception; present only in descendants of that cell; not heritable
De novo variantArises new in an individual; not present in either parent
HaploinsufficiencyOne functional copy of a gene is insufficient; LoF variants are dominant
Loss of functionVariant reduces or eliminates gene product activity
Gain of functionVariant creates or enhances an activity not present in the normal protein
Dominant negativeVariant protein interferes with the normal protein from the other allele
PathogenicStrong evidence supports disease causation
VUSVariant of Uncertain Significance; insufficient evidence to classify
HeterozygousVariant present on one chromosome; reference on the other
HomozygousVariant present on both chromosomes
HemizygousOnly one copy of the gene exists (X-linked in males; deletions)

---

Check Your Understanding

Answer these without looking back — then revisit any you're unsure of.

  1. What is the difference between a variant and a mutation — and why does the word choice matter?
  2. What makes an indel a frameshift variant — and why is this significant?
  3. A woman has a BRCA1 variant in her blood test. Her tumor biopsy also has a BRCA1 variant, but it's different from the one in her blood. How do you explain this?
  4. A gene is haploinsufficient. A person is heterozygous for a loss-of-function variant in that gene. Will they be affected? Why?
  5. A variant in an African American patient is classified as VUS. The same variant is classified as "likely benign" in a database populated mostly by European patients. What is the most likely explanation?
  6. What is compound heterozygosity and why does it matter for recessive disease? ---
Where this takes you

You now understand what variants are and how they cause disease. But how do we find them in the first place? How do you go from a blood sample to a list of 12,000 variants?

That requires sequencing — the technology that turned genomics from a 13-year, $3-billion project into a $200 clinical test.

→ Module 3: How we read genomes — sequencing without the mystery

---

Primary sources:

  • Richards, S. et al. (2015). "Standards and guidelines for the interpretation of sequence variants." Genetics in Medicine, 17, 405–423. Open access.
  • Martin, A.R. et al. (2019). "Clinical use of current polygenic risk scores may exacerbate health disparities." Nature Genetics, 51, 584–591.
  • Hanahan, D. & Weinberg, R.A. (2011). "Hallmarks of Cancer: The Next Generation." Cell, 144, 646–674.