New Method Aims to Sequence the Genomes of the Most Elusive Microbes

(CORRECTED 6/10/14: An earlier version of this article stated that Christopher Beitel's paper in PeerJ was published around the same time as the G3 paper that is the main focus of this story. In fact, a preprint of the PeerJ paper was available online as early as this February; the peer-reviewed version appeared in May. The text has been changed to reflect this.)

By Aaron Krol

June 9, 2014 | You may not have heard of Pelagibacter ubique, a tiny, unassuming bacterium that floats along the ocean surface, but it is very likely the most abundant living thing on Earth. P. ubique is part of the SAR11 clade, a group that together makes up as much as half of all the bacteria in the ocean. There’s a lot to love about P. ubique, including its fascinating genome, which is the smallest of any known non-parasitic organism and saves space by having no apparent “junk DNA” and almost no gaps between genes.

Perhaps the most astonishing fact about P. ubique, however, is how hard it was to find. As late as 1990, no one had any inkling that the SAR11 clade existed at all, despite its living — teeming, even — all over the world, right on the ocean surface. The group was only discovered when researchers started sequencing the DNA found drifting in seawater, and found a particular gene, which coded for an unknown type of ribosomal RNA, in such staggering numbers that a first guess at the worldwide population of SAR11 placed it in the tens of billions of billions of billions.

How did these hyper-abundant fellow-travelers go unnoticed for so long? P. ubique is one of the many bacteria that microbiologists refer to as “unculturable,” meaning it does not like to grow in the lab. This makes it essentially invisible to us, since we typically need to grow colonies of bacteria to learn about them.

In fact, we have a pretty skewed view of the microbes around us. We’ve gotten to know a few of them, like E. coli, very well — partly because they’re important to us, but mainly because they have no trouble living in petri dishes. But in the complex ecosystems that surround us, these culturable critters are the tiny minority. Somewhere around 99% of microbes refuse to be taken out of their natural habitats.

Most of the unculturable species will have to be discovered the way P. ubique was: DNA-first. In an age of booming interest in microbial communities, especially those that live in and on the human body, scientists are sequencing mixed populations of bacteria, fungi, and other microbes faster than ever. “The most straightforward way is to essentially just grind up the whole mixture, and make DNA preps for sequencing from the whole population,” says Maitreya Dunham, a professor at the Department of Genome Sciences at the University of Washington, whose lab specializes in the genetics of yeast. Sequencing this way is relatively fast and easy, and it will give you a “metagenome,” a sort of mish-mash of all the genetic fragments that could be stitched together in your sample.

“The problem,” Dunham continues, “is you don’t really know which sequences came from the same species. You can frequently cluster them into categories, and say these are likely from this branch of the evolutionary tree, and this type of bacteria, but it can be difficult to tell exactly. And of course it’s difficult to check your work, because you can’t grow the individual components.” The only way to resolve your sample to the level of particular species is if you have reference genomes for those species at hand, to map your reads against. If you’re interested in the unknown species whose DNA makes up the more mysterious areas of the metagenome, you’re out of luck.

This May, however, Dunham and several of her colleagues published a paper in G3, the open access journal of the Genetics Society of America, that hit on a new method for sequencing microbe communities. In two test runs in the lab, their method was able to split mixed samples into their constituent species, and recreate those species’ genomes, without any prior knowledge of what was in the batch. If it proves workable in natural samples, this technique could be our first chance to get to know unculturable microbes at the species level on a large scale.

Cell Mates

The new work from the University of Washington is based on Hi-C, a genetics technique that originally had nothing to do with microbe communities. Hi-C was invented as a way to find out which regions of a particular genome tend to interact with each other. (The name is an in-joke among geneticists that would take an obnoxious amount of time to explain*.) It starts by treating cells with formaldehyde, which causes DNA to form tight linkages at sites of heavy interaction. With a series of chemical treatments, you can make the DNA bond around the cross-linked sites, creating fragments that contain DNA from two different areas of the genome. Then you chop out those fragments and sequence them, which shows you which regions were caught together during the formaldehyde treatment. Using Hi-C, you might learn that two genes on separate chromosomes actually have an important relationship with one another.

Although Hi-C is a finicky procedure, it’s quickly become popular, and people are finding new uses for it all the time. At the University of Washington, a graduate student named Josh Burton had hit on a way to improve genome assemblies with Hi-C. Since most cross-linking happens at sites that are close to each other on the same chromosome, this short-range linking information can help stitch DNA fragments together in the right order when trying to build a genome from scratch.

But Burton was also thinking of other creative uses for Hi-C. “I’m on Josh’s committee,” says Dunham, “and at one of his committee meetings, he said, ‘What if we did this with a metagenomic sample?’ He and my post-doc [Ivan Liachko], it turned out, had been talking about this idea.”

It’s a clever twist on the method. If you perform Hi-C on a whole community of microbes, the formaldehyde is added before you break up the members’ cells to get at their DNA. That means the cross-links should only form between regions inside the same cell — so any linked fragments must belong to the same species. That’s valuable information, because two fragments linked by Hi-C will often come from disconnected areas of the metagenome. The Hi-C links drag together pieces of the metagenome until nearly every bit of sequence has been pulled into one of a few clusters, each representing a unique species. Burton wrote a computational tool called MetaPhase to perform this analysis.

MetaPhase clusters

Results from MetaPhase, showing clusters of genetic fragments from 18 different species of bacteria, yeasts, and archaea. Reproduced from G3 with permission of the authors

To test the idea, the team — which included lead authors Burton and Liachko, and corresponding authors Dunham and Jay Shendure — threw 13 species of yeast together in culture and set about sequencing them. “They were all the species that I currently had in my freezer, from our other work,” remembers Dunham. Yeasts were a useful test case because their genomes are organized differently from bacteria and other prokaryotes. With multiple chromosomes, a yeast genome is impossible to put together based on overlapping fragments alone: to cluster the right chromosomes together, you need the extra Hi-C information on whether they came from the same cell.

Sure enough, MetaPhase managed to divide the metagenome into 12 clusters, and after checking those clusters against the reference genomes for the various yeast species in the batch, the team found that over 99% of sequence was assigned to the right species. (One species, which was mixed in at a much lower abundance to see if the method could capture rare populations, was not picked up by the sequencer strongly enough to be reconstructed.)

“Once we showed that it worked on that sample, we got super excited,” says Dunham. “Ivan went and searched the microbiology department, and all the neighboring labs, and asked them for any bugs that they had. That’s how we came up with the second synthetic mixture, which was a bunch of fungal species, but also a bunch of bacteria and an archaea species – which was great, because we did want to demonstrate that it would work on bacteria as well.” This second mix had nine species of bacteria, eight yeasts, and one archaea. MetaPhase clustered all 18 species correctly, and for the bacteria and archaea, the accuracy rate was nearly 99.9%.

Dunham is hopeful that this high level of accuracy will stand up when looking at bacteria in the real world — potentially even different strains of the same species. “What’s fascinating about bacteria is that, even within something that can have the same species name, there’s huge diversity,” she says. “Calling something E. coli is not that useful unless you can say, is it the good E. coli or the bad E. coli? I think you would be able to tell such things apart.”

Mysterious Genomes

One of the cool things about the MetaPhase method is all the extra information it provides about a microbial community. After sorting the pieces of the metagenome into the correct species bins, it’s possible to build an entire draft genome for each species, in part by using the Hi-C data on which sequences are close together on the same chromosome. That means scientists should be able, by applying this method to real-world samples, to recreate the whole genome of a species that no one has ever seen or isolated. Such a species would be known only by its complete DNA code.

One can also get a rough idea, looking at how many sequencing reads were assigned to each cluster, of how abundant any given species is inside the community. The team also produced a standard Hi-C analysis for each species in the two batches. “We incidentally did the first Hi-C of an archaeal species, totally by accident, because it was in the mixture,” says Dunham. “We just included it in one of our clusters and didn’t do any analysis of it. I’m really hoping that some expert in archaeal genomes will find it and try to use it for other purposes.”

“The next obvious step is to apply it to a real sample, so we’re attempting that right now,” she adds. “Our mixtures had less than twenty species in them, and real metagenomic samples can have thousands of species, depending on where they come from, so we’re collecting a range of samples.” This will be the real test of the team’s method. It will be much harder to verify whether genomes from natural microbe communities are accurate, since many will belong to species with no premade reference genomes to check against.

Dunham is hoping that experts from all kinds of relevant fields — marine microbiology, soil microbiology, the human microbiome, yeast researchers like her own lab — will get on board and start following up with their own MetaPhase analyses and confirming experiments. The MetaPhase code is freely available on GitHub for any lab to take advantage of. Meanwhile, another team from UC Davis, led by Christopher Beitel, published a very similar method in the open access journal PeerJ this February.

A lot of existing research stands to benefit from MetaPhase’s ability to sort a mixed metagenome into separate species. One common practice now is to create metagenomes of all the bacteria and fungi that live in the human gut, which can tell researchers what broad groups of microbes are living inside us. Even in this very intimate environment, however, the huge majority of species are unculturable and undiscovered.

“People who already have samples in their freezers, which they’ve done the first layer of metagenomics on, might have sample left over that they could do the Hi-C method on,” says Dunham. Making a second pass with MetaPhase could finally bring the fuzzy visions of past experiments into focus.

There’s a world of alien organisms all around us, and while most of them are shy, their DNA can’t be hidden away. It’s time we got acquainted.

* Oh, very well. The first method for looking at chromosome interactions, which was much weaker than Hi-C because you had to know in advance what regions of the genome you wanted to check, was called “chromosome conformation capture,” or 3C for short. Over the years, geneticists kept improving on 3C, first with “circular chromosome conformation capture” (4C), and later with “carbon-copy chromosome conformation capture” (5C). After that point it became too unwieldy to keep adding more C words to the names, so when another new technique was invented in 2012, the creators called it Hi-C, like the fruit punch.