By Nathaniel Pearson
August 27, 2013
| A talk
at the Clinical Genome Conference
resonated with some folks, who suggested sharing it. Crucially, that crowd included doctors, who have much to both teach and learn in the brave new world of genomic medicine. With them on hand, the day’s session loosely echoed a grand rounds, where the case was, soberingly, the tall order of making genomes widely useful in healthcare.
In such circles, a genomicist like me is a narrow specialist. Bigger speakers — general practitioners, so to speak — were lined up to cover the chronic, integrative needs on the conference bingo card
: convening stakeholders; establishing standards for reporting and payment; managing big data…a wordsquall that looms over our field, easy to discuss
but hard to fix.
So rather than tackle such broad challenges, my talk stayed bite-size. Building from insights on genome structure, function, and variation, it urged three small but concrete ways to help put genome-informed healthcare on firmer footing.
- Use different reference genomes to align a person’s raw data (pick reference(s) most like her/him) versus store her/his finished genome (as clear or potential differences from the human ancestral reference).
- Clinically classify genotypes, not variants.
- Filter a genome against other individuated genomes, not allele frequency tables.
Though these ideas aren’t new, they would break convention — so need justifying. But even if you skip the explanations that follow, know that the proposals reflect long thought on how current convention, rooted in sparse data, will ultimately fail for millions of whole human genomes. Thus consider them early course tweaks that can save bigger tacks later, en route to genomically informed healthcare for all of us.
Small step I: The right reference genome(s)
A reference genome — we’ll just say reference — is a long string of letters used as a common template for comparing the genomes of closely related organisms, such as people. As an archetype, a reference often shortens and simplifies real genomes,1
to help read, write, and interpret them.
In teasing apart these tasks, note that today we use the same human reference for all three…and that it’s right for none of them. Below we’ll see why, and what we should do about it. But if you’re rushed, here’s the gist:
The current single reference is arbitrary and ethnocentric; inevitably misaligns most people’s raw data; and is poor for writing and interpreting genomes afterward, because it includes rare and risky variants, and muddles summary insights on data quality and evolution.
An alternative made of just common or putatively healthy variants would still be unreliable for aligning raw data, and as a foil for writing and interpreting genomes.
Instead, we should read your genome by aligning raw data to references most like you (we can usually guess which). We should then write all our genomes against the human ancestral reference — a solution that’s ethnically neutral, straightforwardly informative on data quality and evolution, and stabler than alternatives.
And we should give up on using any reference to proxy an idealized healthy genome. As later posts will detail, reliable health insight will instead require comparing your genome to the individuated whole genomes of many other people who, like each of us, get some diseases and not others.
Ok, now let’s walk through those reference tasks in detail, to better understand why we must do them differently.
To read your genome — that is, to make out the long eye chart of letters that form it — a modern sequencer streams zillions of DNA snippets, each copying a chromosome tract roughly at random. By comparing each snippet to a good reference, a computer can find where it best fits, much as we match jigsaw puzzle pieces to the picture on the box. As snippets pile up, the computer surveys what DNA letter(s) amass over each spot, to guess what letter(s) your chromosomes carry there.2
Conventionally, we’ve taken a one-size-fits-all approach to this task of aligning snippets, using the same reference, called Hg# (where today # = 19), to scaffold everyone’s genomes. But Hg# wasn’t carved in stone. Instead, it’s quilted from several real people‘s genomes that were read by costly, reference-free methods. And the haphazardly picked people who contributed to it have their own ancestry, which gives Hg# their genetic quirks.
As a result, some human genomes are more like Hg# than others. And if my genome resembles it more than yours does, my snippets will, on average, align more reliably. Conversely, because big populations tend to be genetically diverse, Hg# — like any single option — inevitably misaligns raw data from most people’s genomes, in ways both big (mutual gaps and rearrangements wreak havoc) and small (clustered small differences leave good snippets unaligned).
In the end, this means that we can best read your genome today by first aligning to the already available genome(s) most like it.3
Happily, skimming your genome — or even just looking at you — strongly hints whose genome(s) might work best.4
How well we can play reference sommelier depends on what options are on hand (more and more, starting with synthetic references that proxy what’s common in a particular part of the world), and how saliently mixed your recent ancestry is. But if needed we can try multiple references, and see which work best for which snippets.
And that raises two deeper points. First, for some genome segments, such as a stunningly diverse and health-relevant stretch of chromosome 6, it’s hard to predict what your genome looks like regardless of where your forebears came from. For such segments, it makes sense to always align your snippets to many reference options.5
Doing so takes a few more electrons, but usefully sharpens the resulting picture of your genome.
Second, aligning your snippets to even one whole real genome would itself be like aligning them to two versions of a conventional reference (with its one copy of each chromosome that’s paired in real genomes). Smartly, that could fully leverage new algorithms that track everywhere a snippet decently fits during alignment, rather than just picking one spot (often by tossup). And that, in turn, would let us read your genome more finely, without — yet — needing the compact simplicity of a conventional reference like Hg#.
Which brings us to the next use of a reference…
After we read your genome in detail, a reference helps write it. Namely, because copies of a given human chromosome are all grossly alike, we can thriftily store yours by just noting where one or both mismatch a simple (single-copy) reference, or were read too poorly to tell.
Everywhere else — typically, >95% of the currently sequenceable parts of human chromosomes — we can assume that your copies match that reference. And because many of your poorly read sites will themselves clump in compressible tracts, we can shrink your genome >>20-fold in the end. That saves memory, of course, but also helps us query it — most usefully, by comparing your DNA (plus your phenotypes, ideally) to others’.
But there’s a catch. Because mutation anywhere on a chromosome can make it longer than some other copy, genomes can best be compared if stored as differences from the same reference, so their mapping coordinates match. That way, like sailors agreeing to track longitude from Greenwich, we can neatly record findings like ‘One of your chromosome 7′s shows five more bases (ACGTA) than mine at reference site 1000; but one of mine shows three fewer bases at reference sites 2000-2002‘…
Note the dilemma here: to read genomes, we should align their snippets to various most-appropriate real reference(s); but to compare them, we should write them as differences from the same simple reference.6
Bottom line, we need task-specific references.
But that still means picking one best reference for writing genomes. Given that so much work has gone into Hg#, we might ask whether it’s the right one. Which leads us to the third use of references…
After shrinking your genome to a list of differences from a reference, we’d like to understand that list — what it says about how sequencing went and, more importantly, about you. We might even hope to use the reference to proxy a healthy genome, so that anything worrisome in your genome stands out from it.
Alas Hg# makes a poor interpretive foil for real genome data, starting with quality control: because Hg# comes from a few modern people, it’s poor not just for aligning, but also for writing, where it can conflate statistical signatures of lab-bench problems (sample contamination, chemistry failure, &c.) with those of ancestry. QC that first compares heterozygosity of particular genome segments, rather than just counting reference sites called with any mismatch, will help there (an issue for another day…), but the problems with the current reference go deeper.
In particular, Hg# also includes many variants already implicated in diseases — which means it won’t always flag your own worrisome DNA spellings7
, and that it troublesomely differs from some single-gene references familiar to clinical geneticists. Moreover, Hg# includes many other variants that, while not yet well studied, are suspiciously rare enough to be harmful too.
Given those shortcomings, many have suggested replacing Hg#’s rare and/or known risky variants with common and/or healthy alternatives, ostensibly yielding a new reference that reliably proxies a healthy, normal genome.
Alas, that won’t work, for two reasons. What’s common varies. And what’s healthy depends.
Informative beats (un)healthy
At first glance, one of your DNA spelling variants may be rare enough on earth overall to intrigue us — but turn out to be boringly common among millions of mostly healthy people in some small patch of the planet. More profoundly, the commonest variant at a genomic site today or in five years may not be the commonest one next year or in ten years. That’s evolution — and it means that a common-only reference is inherently unstable.
On the health side, meanwhile, many variants aren’t simply good or bad. Their effects depend on what how many copies you have (0, 1, or 2), what disease we ask about, and what other variants lurk in your genome.
You may know a few such twists already. One or two copies of T here help avoid malaria and high cholesterol — but two copies leave you with crippling anemia. One copy of A over there can drive breast cancer, but mainly if you also lack a working copy of the SRY gene (which, on the flipside, helps you avoid testicular cancer, among other diseases…). And so forth.
Data from billions of us will unfurl more astounding complexity, where variants throughout your genome — some inevitably present in any reference we use — interact in surprising ways with each other, and with habits and other factors, to favor some diseases and disfavor others.8
Other posts will further explore how this hard truth should alter our approach to genomic healthcare. Here, it simply dooms any hope of using any reference to reliably proxy what’s healthy.
And more deeply, using a reference like Hg# as an interpretive yardstick also obscures how genomes change and, by extension, how various kinds of changes tend to affect health in the first place. Hg# can’t, for example, tell us whether a so-called deletion in your genome (where it’s missing a tract found in Hg#) really reflects a mutation that deleted bases in you or your forebear, or instead reflects an insertion of bases in someone who contributed to Hg#.
As such, because a given letter in a reference like Hg# could itself reflect a past mutation, writing everyone’s genomes as differences from Hg# makes statistical questions like ‘How often does the snippet CG mutate to TG? And how well does that TG survive, over generations, if it changed a protein’s arginine to cysteine?’ trickier than they should be.
Such questions matter. They can unlock basic physiology (How do mutations happen? Why do tumors correct them so poorly?); hint how a new variant may affect health (Does changing active-site arginine to cysteine often make an enzyme fail?); and clarify how variants interact with each other, and with habits, to cause disease (Why do some genetic variants, like APOE4, make us sick but leave chimpanzees healthy?).
Those big questions require the big data inside us. Even if no more than a handful of your DNA spellings alter your own healthcare, the rest of them, pooled with similar data from all of us, can shed light on many diseases to greatly refine care for our grandkids.
But using a conventional reference like Hg# needlessly hinders that effort. So while we must abandon the idea of any reference reliably proxying a healthy genome, can we at least find a sensible reference to write and compare the coming flood of genomes, to catalyze those deeper insights?
An ancestral reference
We can. The sensible yardstick for writing your genome is the human ancestral reference (HAR) — that is, a single-copy genome comprising, at each chromosomal site, the DNA letter carried by the last common ancestor of all people for that site.
In picturing the HAR, note two things. First, Suganthi Balasubramanian and colleagues have already built (and used) it, nearly site for site9
, by comparing our genomes to those of other great apes. Second, two genomic sites can trace to different last common ancestors. That’s because, when eggs and sperm are made, chromosomes pair up, swap segments, and move into different cells. Each copy of a chromosome thus quilts together pieces of earlier copies; so everyone’s last common ancestor for site 1000 may not be our last common ancestor for site 1001 (they may have even lived eons apart). Which also means it’s implausible that any person ever carried the whole HAR.10
Among reference options for writing and comparing our genomes, the HAR uniquely combines several appealing features:
- It’s neutral. As noted, no one ever carried the whole HAR. And because the mutations that distinguish our genomes from it have struck roughly randomly among our ancestors, your genome resembles it about as much as mine does.11
In this important sense, the HAR belongs to none of us, and to all of us. Being roughly equidistant from everyone, it offers a uniform, non-ethnocentric baseline for assessing sequencing quality, and for reporting what’s genetically distinctive about you.
- It’s stable. The HAR actually looks a lot like a common-variants-only reference, because nearly all ancestral variants are common. But while a common-only reference would in principle need many edits each year to stay perfectly accurate, the HAR would need just one or two (as atypically rare ancestral variants go extinct).12
Such editing isn’t urgent, because few variants with allele frequency near 50% are functionally intriguing enough, or surveyed precisely enough in the population, to day-trade anyway. But that just makes it even smarter to build a reference on the stable, reliably inferrable, and meaningful criterion that a variant be ancestral, rather than worry whether its allele frequency fell to 49.9%. That way, we get summary insights even from otherwise boring variants — and a low-maintenance reference to boot.
- It’s compact but comprehensive. Like conventional references, the HAR is a simple single-copy (haploid) genome. Real genomes, compressed against it, would yield files consistently intermediate in size between the biggest and smallest files compressed against Hg#.13
Nonetheless, because new chunks of DNA are usually copied from chunks elsewhere in the same genome, the HAR includes source DNA for nearly all chunks of real human genomes (missing only those recently copied from viruses or bacteria, or other oddities). Other reference options tend to be less comprehensive on these counts, which poses an ongoing dilemma of when to add a segmental copy (to make them more thorough), versus omit it (to keep them compact).14
That dilemma would still apply, but the HAR offers a framework for handling such segments that we choose to include in the extended HAR that Subramanian et al. proposed. For a newly arisen extra segment that some but not all people have, variation among such copies could in turn be mapped to common coordinates in the inferred earliest (nearest to universally ancestral) version of the new copy.
- It’s directly informative. Most importantly, the HAR is the only reference option that directly shows how human genomes change. As a foil for writing all our genomes, it would thus most quickly reveal summary patterns of change that in turn shed light on basic biology and health.
Concretely, if shortening one bend in a protein makes people sick, but lengthening it — or shortening another bend — doesn’t, the HAR would let clinical geneticists reliably spot this faster than other references would.
The benefits of an ancestral reference for making sense of genomes, both individually and together, have long been starkly clear for geneticists studying the first fully sequenced human chromosome: the mitochondrial genome (mtDNA).
Starting in 1981, we used the first sequence of a person’s mtDNA
as a reference. That sequence forms a leaf on the simple evolutionary tree
that binds all our mtDNA versions. And because each of us gets only our mom’s version of this short but gene-rich chromosome, with no backup from dad, that tree’s branches are key foci of health research.
But using a modern person’s mtDNA as a reference meant treating a leaf as if it were the treetrunk. Like a concave mirror, this flipped and warped our view of the tree, prompting epicycle-like contortions to figure out where your leaf was, and how your branch may or may not have mutated in telling ways.
In 2012, researchers cut that gordian knot, proposing the human ancestral mtDNA
as a reference for writing real genomes. That new reference lets you easily a) find your mtDNA leaf, and b) see how DNA has changed throughout the tree, to better understand key biological processes.
Having learned the hard way with mtDNA, we needn’t wait 31 years for our other chromosomes. In the end, by using multiple references to align raw data, and adopting the HAR to write and compare our finished genomes, we can best read, write, and learn from the millions of human genomes soon to be sequenced.
Getting smarter with references
So where are we, as a community, on human reference genomes?
On the practical side, we’re accumulating diverse, ever better sequenced human genomes that can serve well as alignment references (and, as a bonus, help benchmark new sequencing methods
). And we’re getting better genomes from elsewhere in the great ape family tree, to refine the HAR.
Moreover, today’s de facto standard, Hg#, continues to improve via thoughtful work by Deanna Church‘s
team and others
. Beyond fixing errors and filling in previously missing segments, the pending Hg20 version will include multiple versions of more segments, in part to better align raw data.
That’s a sensible stopgap, until more folks start picking from multiple alignment references from the start. But adding alternate versions of more segments to Hg# requires ongoing arbitrary choices, slows the task of writing finished genomes, and tends to statistically weaken comparisons of many genomes. The latter jobs are really better served by writing genomes against the HAR.
Communal habits like using Hg# for all human reference needs are hard to break — even for open-minded scientists (and maybe moreso in famously stubborn medicine). But given the clear flaws in our current approach to reference genomes, it’s likely better to break those bad habits now than let them entrench further, as we start sequencing patients’ genomes by the thousands (and more).
Making all those genomes useful in healthcare, for us and future generations, will mean reading them well; writing them efficiently; and, as coming posts will explore further, interpreting them wisely.
All these goals rest on the bedrock of reference genomes. Let’s get them right.
Nathaniel Pearson is Principal Genome Scientist at Ingenuity Systems. Previously he served as senior director of science and research at Knome. He blogs at genomena.com. This piece was also posted on his blog on August 26.
Today’s ~2.9 billion-letter human reference, for example, comprises just one version of each of the distinct-looking molecules (chromosomes 1-22, X, Y, and M) in the >6.5 billion-letter genome of a man’s skin cell. That cell’s genome comprises two copies of most such chromosomes — and those copies, in turn, differ in chemical makeup (base sequence), and include tracts that have never been seen (or successfully sequenced), so are simply missing from the reference.
Importantly, many snippets from your genome differ from even their best-matching parts of the puzzlebox picture (otherwise, why bother sequencing?). But the reference template still helps piece them together faster than we otherwise could. And by piling many snippets over each site, we can tune out errors from cooking finicky chemicals under tiny image sensors — a bit like how astronomers, at the other end of the spatial scale, distinguish lasting light sources from noise by overlaying many pictures of the same part of the sky.
That point was moot in 2003, when we had to use the hardwon sequence that became the current reference. But since then, we’re bootstrapping our way to good sequences of many human genomes from around the world — a pool that we should tap to better align newly sequenced genomes, as some folks have already shown.
Ideally, we’d use parents’ genomes to align those of their kids…but when sequencing is common enough to make that practical, sequencers will likely make longer snippets that are easier to piece together from the start anyway, even without aligning to a reference.
Helpfully, Hg# itself includes several options for some such segments — and those who built and refine it plan to add some more.
Note that even though we write them as differences from a simple reference, which has just one copy of each chromosome, we can still keep track of which spellings go together on each copy of your chromosomes (if our sequencing method was good enough to tell in the first place).
Especially if whoever compressed your genome didn’t bother noting where your genome was too poorly sequenced to know what it carries — a corner that geneticists too often cut.
Such insights stand to turn much of the noise that we currently sweep under the rug of partial penetrance into far better understood signal — think, for example, about how genetic insight turned the apparent random noise of why a baby was born female or male into causal signal tracing largely to the sex chromosomes.
For the remaining sites, we can’t reliably guess what variant our last common ancestor carried, because the state of variation we see among our copies extends to other great apes, suggesting that such variation has lasted too long to reliably unravel. In extreme cases, like the sex chromosomes, such lasting variation is already enshrined in the current reference (Hg# has one X sequence and one Y sequence, despite the fact that not everyone has the latter).
Even the last common ancestors who contributed to the HAR had variants in their own genomes that aren’t in it.
Planetwide, we have no idea whose genome happens to differ from it most, though that person — ironically, in some sense the most evolved (geneticists would say derived) of us — is almost certainly very sick, thanks to gross genetic changes…
Many thanks to Graham Coop and Justin Fay for helping think through the relevant numbers here.
Average compression is the main measure that could instead be optimized by a common-only reference. But the HAR has several substantive advantages over that less stable and informative option.
Note, btw, that this question matters most if we’re using a reference to align snippets — which we’re not proposing here. But we do need to map each of the alignment references themselves to the writing/comparing reference, which is where it helps to make sure the latter includes source DNA for the segments that those alignment references may have extra or fewer copies of.