YouTube Facebook LinkedIn Google+ Twitter Xingrss  

Sequencing the Earth


The future role of FPGAs in enabling next generation discovery.

By Vincent Natoli

September 27, 2011 | Guest Commentary | In his acclaimed history of the Manhattan Project, The Making of the Atomic Bomb, Richard Rhodes recounts the story of the first visit to the US by the great Neils Bohr. Years earlier, Bohr had calculated the effort required to refine sufficient quantities of enriched Uranium to produce a nuclear weapon. On visiting Los Alamos in 1943, he declared that such a feat “couldn’t be done without turning the whole country into a factory,” before adding, “You have done just that.”

The effort to bring about the personal genomics revolution, pushing full genome sequencing into clinical applications, suggests the same scale of effort. Consider the sequencing data produced for just 100,000 people. It totals 30 petabytes and would require roughly 12 full racks of state-of-the-art compute nodes running for a year to process. To sequence the population of the U.S. would require a system 3,500 times larger: 42,000 racks operating continuously.

Yet it’s likely that each of us will be sequenced multiple times in our lifetime. Cancer patients may have their tumors sequenced on a recurring basis as treatment progresses, just as cholesterol levels are routinely monitored. The symbiotic bacteria in our bodies would be another sequencing target, not to mention agricultural products and livestock. Over time, we can compare our genomes with our neighbors’ looking for patterns and correlations, an obscenely large problem vaguely similar to the clustering algorithms run by Netflix and Amazon on vastly smaller data sets to recommend movies or books.

The computational burden of sequencing and analysis is immense, requiring trillions of CPU core hours. Processing data on this industrial scale will require huge data centers crunching bits day and night, and is indeed reminiscent of Bohr’s comments.

At this scale, two trends will emerge. First, genome processing itself will become a commodity service, as we are already witnessing (e.g. Complete Genomics, GenomeQuest, BGI). Organizations with the skills to run huge compute clusters efficiently (e.g. Google, Amazon, Microsoft) will likely become engaged in the market. The second trend will be a standardization of algorithms for mapping. For clinical applications, it makes sense that our genomes be processed using the same set of algorithms. This might come about by government decree or by mutual agreement from an industry board. Because sequencing machines differ in data output, there may be an approved algorithm for each vendor that conforms to a set of standards for accuracy and sensitivity.

Move to Commoditization

The trend to commodity data processing argues for the least expensive solutions that solve the problem to the rigor required and standardization of algorithms argues for the benefits of an appliance approach. Heterogeneous compute platforms (i.e. platforms complementing CPUs with other options) are a compelling option for genomic data processing because of inherent advantages in cost, power and footprint per unit of performance. Field Programmable Gate Arrays (FPGAs)—commodity chips used in a variety of networking, signal processing and consumer applications—are a particularly effective option (see, “Accelerated Methods for Bioinformatics Analysis,” Bio•IT World, Jan 2010).

FPGAs provide a huge fabric of simple processing elements that can be configured into almost any algorithm of interest. By implementing algorithms in hardware as opposed to software, large efficiencies in cost, space, and power can be achieved.

The suitability of FPGAs for bioinformatic computations has been appreciated since the late 1980’s. The Smith Waterman local alignment algorithm, for example, has become somewhat of a “Hello World” problem for FPGA-based bioinformatics. With just four base pairs, DNA can be represented by 2-bit data structures. The custom data size is easily accommodated by dense arrays of compact processing elements on FPGAs, unlike CPUs, which are made to move around data in 32- and 64-bit chunks. The bioinformatics processing kernels have a natural streaming character, which maps efficiently to FPGA architectures where temporal parallelism through pipelining can be exploited. Finally, the problem is massively parallel at multiple levels. Parallelism can be exploited on the billions of independent reads and on the number of positions that must be checked in the genome. In work sponsored by the National Science Foundation, we have found FPGAs to be at least 5x less expensive up front, 20x more power efficient and 15x denser than standard CPU solutions for equivalent performance on algorithms relevant to bioinformatics.

The explosion in genome sequencing will lead to the consolidation of data centers, commoditization of the process and standardization of algorithms, all of which argue strongly for the efficiency and performance gains that FPGAs afford. For personal genomics to flourish, improvements must be made along the whole pipeline from collection to sequencing to storage, analysis and interpretation. The rapid improvements in NGS have uncovered bottlenecks in other parts of the pipeline in computing and storage to downstream processing and interpretation. FPGAs are a superior, cost-effective solution ideally suited to solve the massive data crunching that personal genomics demands.  

Vincent Natoli is CEO of Stone Ridge Technology. He can be reached at: vnatoli@stoneridgetechnology.com

This article also appeared in the 2011 September-October issue of Bio-IT World magazine.
Click here to login and leave a comment.  

2 Comments

  • Avatar

    I can assure the commenter that the debate concerning special-purpose hardware vs. general computing hardware began many years before the invention of either perl or java. I also point out that Moore's Law applies equally to FPGAs as to general purpose CPUS. The author's point, therefore, remains valid: there are cases where applying dedicated, simplified computational resources will always be more efficient than using tools designed essentially as compromises to the broader computational landscape. Signal processors, graphics chips, etc. are still in use even though CPUs are a lot faster that they used to be. However, in the case of FPGAs (GPUs, ASCIS, whatever), the real issue is whether there will be agreed upon methods and algorithms that can be reduced to specialized tools that remain used over a reasonably long time - CPUs will always have the advantage when the need for flexibility is paramount. But the author is right, commoditization of the basic genomic description of an individual will and must occur as we move forward with clinical genomics as a standard of care. There will be various methods applied to achieve this characterization, of course, and FPGAs certainly hold the promise to contribute to some of them, particularly mapping and/or assembly requirements. Others will require general-purpose or other forms of special purpose architectures to achieve. And the notion of the commenter that this commoditization is somehow antithetical to good medical practice seems quaint at best. I for one don't expect or even want my physician presuming to analyze my siz billion bases using his or hers creative point of view that is not likely to be as well founded as the community insight embodied in standardized algorithms. Frankly, this pretty goes for most of medical practice. There's always going to be the Dr. Houses of the world, but they are few and far between. I'll take my chances with collective wisdom that evolves over time through the c

  • Avatar

    The author's assertion that "genome processing will become a commodity service" is misleading. If we assume he implies "genome sequencing" when he uses the term, then although he is correct about the commoditization of sequencing, he has an error of fact as my company, GenomeQuest, is not in the sequencing business.

    If instead he means to imply "genome reassembly," that is, the mapping and variant calling that the vast majority of the industry is swept away with, then again I would agree that this is a commoditizing industry and this is why Complete Genomics prices it alongside their sequencing service, and why Illumina plans to provides variants rather than reads, and why GenomeQuest happily will perform that computation or accept somebody else's computation as a starting point. Again, the error of fact is that GenomeQuest is not in the commoditizing business of genome reassembly.

    If he means to imply "genome interpretation," that is, the translation of genomic variation to utility in research or clinical applications, then certainly he is correct to name GenomeQuest as a player in this arena, however in this case he is frankly incorrect in his assertion that this is a commodity. Interpretation of a continuously growing body of content is by definition not commodity. If it were, then medicine would be a commodity. Invention would be a commodity. Creativity would be a commodity.

    As to the FPGA argument, it's bioinformatics' 2nd longest religious argument (the first being PERL vs Java). I for one don't see a need to bet against Moore's Law, particularly in a commodity business like rote computation. But hey, I'm also an Atheist.

    Sincerely,
    Richard Resnick
    CEO
    GenomeQuest, Inc.

Add Comment

Text Only 2000 character limit

Page 1 of 1


For reprints and/or copyright permission, please contact  Terry Manning, 781.972.1349 , tmanning@healthtech.com.