January 10, 2012 | From familiar but repeat-laden plant species to obscure vertebrates, more and more genomes are being sequenced that require de novo assembly without alignment to a reference sequence. “Every genome has its own story in terms of repeats,” says Ian Korf, associate director of bioinformatics at the University of California Davis Genome Center.
Korf is one of the principal organizers of a genome assembly challenge known as the Assemblathon—a competition to identify best practices in the de novo assembly of complex plant and animal genomes. Results of the first phase of the Assemblathon were recently published in Genome Research.
Korf discussed some of the Assemblathon results and, more broadly, the inherent challenges in genome assembly in a recent webinar hosted by the community forum NGS Leaders. “Sequence analysis starts after genome assembly—you can’t do much beforehand... Every genome is a complex genome—even the simpler ones are pretty complex. There’s no easy genome,” says Korf.
The Assemblathon grew out of the G10K project, which is an effort to sequence 10,000 vertebrate genomes. Clearly, in order to sequence and assemble 10,000 genomes, it is crucial to know what is the best sequencing and assembly technology for the money. “It definitely becomes a cost-benefit ratio looking at 10,000 genomes,” says Korf.
Joseph DeRisi (Berkeley), David Haussler (UCSC), and Illumina helped launch the Assemblathon idea. The original goal was to make two targets: one was a real genome (snake), the other a synthetic genome, to enable participants to determine how well they performed. In the event, the snake data weren’t ready, so Assemblathon I, which took place in early 2011, utilized just the synthetic genome.
To create the synthetic genome, the organizers took a copy of human chromosome 13, and artificially evolved the sequence using Evolver software, which introduced mutations in different regions (coding/non-coding) and at different rates. “The sequences were human-ish, but after 200 million years of evolution, didn’t look that human,” says Korf.
The Assemblathon participants—17 groups in all—were then challenged to put the synthetic reads together. “Because we knew the answer, we could evaluate each one of the assemblers,” says Korf.
Results Are In
Commenting on the results, Korf said: “A lot did a pretty good job, but it’s more difficult to assemble regions with more mutations, so the coding regions were assembled better than non-coding regions.” (The contest did not test the growing number of commercial assembly packages, from the likes of CLC bio, DNAStar, Gene Codes and others.)
The assemblies were ranked by various criteria, including contig and scaffold paths, structural and copy number errors, and so on. In the final rankings, the top five were:
• Broad Institute (ALLPATHS-LG)
• BGI (SOAPdenovo)
• Wellcome Trust Sanger Institute (SGA)
• DOE Joint Genome Institute (Meraculous)
• Cold Spring Harbor Lab (Quake, Celera, Bambus2)
Several useful tools emerged, says Korf, but experience in using the tools makes a big difference. “We found that sometimes two groups will use the same assembler, but the group that knows a bit more about the assembler might do a slightly better job. It’s something of an art at this point,” said Korf.
Korf says that wisely choosing the many different parameters involved in de novo genome assembly is difficult and “probably shouldn’t be attempted by amateurs.” He advises inexperienced users to “contact one of the major sequencing centers and get them to help you. Doing it on your own is pretty much guaranteed to give you a sub-optimal assembly… Don’t jump into genome assembly thinking it’s just like any other bioinformatics problem you can hack with some Perl scripts.”
It starts as far upstream as DNA library preparation. “You don’t want to choose the assembler as the last thing you do,” says Korf. “It must be in conjunction with the sequencing technology, how are the libraries made, the full equation. You can’t do it stepwise.”
Library preparation is a non-trivial step. “It’s really garbage in, garbage out,” says Korf. “So much is dependent on having high quality sequence and making your libraries correctly.” Indeed, some Assemblathon participants believe there should be a library construction competition, because that’s more important in some ways.
Another wise move, says Korf, is to perform a pilot project “to explore what your genome looks like” and gauge the overall repeat content of the genome in question. For example, in a recent project to sequence the gigantic pine genome (22 billion bases, three times larger than the human genome), his group was surprised by the extent to which the DNA repeats were diverged. The most common repeat only made up 3% of the genome, making it easier than expected to assemble. “You should do a little homework ahead of time to get an idea of GC content and other factors,” says Korf.
The availability of longer read lengths, such as those produced by the Pacific Biosciences platform, should complement existing short read systems and prove a boost for genome assemblies. “The long reads are fantastic, but the error rate is a bit of an issue,” said Mario Caccamo, head of bioinformatics at The Genome Analysis Centre in the UK and a co-author of the Assemblathon I report.
But Korf says the PacBio reads can prove very useful in integrating with short read data: “Genome assembly with longer reads will get much, much easier. The game will be completely changed with reads on the order of 10 kilobases.”
As an analogy, he offered two paragraphs, representing the shorter Illumina reads and the PacBio read lengths (see, “Reads and Errors”), pointing out that despite the errors in the PacBio segment, it was still possible to interpret the text.
Korf says UC Davis was one of the first centers to receive a commercial PacBio machine, although it is currently being used for studies of genome biology (centromere structure, fragile X repeats) than assembly. “There are certain things that PacBio can do that nobody else can do. We like those things! Yes, there are errors, but better to ask the question with a few errors and have to do a little harder analysis than not be able to do it at all.”
Korf believes the NGS community—“super smart people, full of competitive spirit”—will figure out how to use these 3rd-generation technologies. “Right now, they haven’t had enough time to figure out how to put it all together, but they will pretty soon,” he says. “What you’ll get three years from now will be a lot better than today.”
For the second round of the Assemblathon, which concluded at the end of 2011, participants have belatedly tackled the snake genome, along with the bird and fish. “The disadvantage is that we don’t know the answer in the end,” said Korf. Those results will be announced later in 2012. •