New Alignment Method to Speed Up De Novo Assembly of Whole Genomes

August 15, 2014

By Bio-IT World Staff

August 15, 2014 | This Thursday, a team of bioinformaticians from the National Biodefense Analysis and Countermeasures Center, the University of Maryland College Park, and sequencing company Pacific Biosciences posted information on their tool MHAP to the life sciences preprint server bioRxiv. MHAP, or MinHash Alignment Process, is a dramatically faster method for ordering DNA fragments sequenced on long-read technologies like the PacBio RS II Sequencer or the Oxford Nanopore MinION, making it easier to assemble whole genomes from scratch without the use of a reference genome. Bio-IT World previously covered MHAP following a presentation by senior author Adam Phillippy at the PacBio User Group Meeting this June; however, the newly released paper features much greater detail, including assemblies of the human genome and four important model organisms.

As the price of DNA sequencing continues to fall, the barrier to analyzing genomes is increasingly the compute power needed to make sense of the raw sequencing data. This is especially true for de novo whole genome assembly, where software tools have typically fallen back on the brute force, computationally demanding method of comparing every DNA fragment's entire sequence against every other fragment in search of overlaps.

MHAP, created by Konstantin Berlin and Sergey Koren, instead renders each fragment as a short series of numbers generated through hash functions. Briefly, every DNA fragment fed into MHAP is split into a string of k-mers (the authors recommend 16-mers for a human genome). Each k-mer is then subjected to the same series of hash functions, which output short numbers. For each hash function, only the smallest of these numbers is stored to identify the DNA fragment. Thus, the fragment is reduced to one number for each hash function, and each of those numbers corresponds to a single k-mer. MHAP can then search for fragments with many of the same identifying numbers, indicating that they have k-mers in common; close matches are overlapped to assemble the whole genome.

According to the authors, this process is so fast that an E. coli genome can be assembled de novo using MHAP in essentially the same amount of time it would take to create a reference-guided assembly of the same genome from short Illumina reads. This would put de novo assembly within reach for small labs that previously only had access to the computing resources for reference-guided assembly. MHAP sees even greater gains in efficiency with larger, eukaryotic genomes.

The authors used MHAP for assembly of a human genome, plus genomes of the bacterium E. coli, the yeast S. cerevisiae, the fruit fly D. melanogaster, and the plant A. thaliana. All these assemblies will be made freely available in GenBank. While none of the assemblies has yet been thoroughly vetted, the authors suggest that their assemblies may have closed gaps in the D. melanogaster, A. thaliana and even human reference genomes, thanks to their ability to use long-read sequencers. The most significant benefit of MHAP, however, is the time and cost savings to genomics labs. Both the D. melanogaster and A. thaliana genomes were assembled on a desktop computer in a matter of days; compared to an assembly of the fruit fly genome using the previous alignment tool BLASR, MHAP used 600 times less computing power.

Outside groups can now review the paper on bioRxiv, as well as MHAP itself, which is available on GitHub.