September 27, 2011 | Guest Commentary | In his acclaimed history of the Manhattan Project, The Making of the Atomic Bomb, Richard Rhodes recounts the story of the first visit to the US by the great Neils Bohr. Years earlier, Bohr had calculated the effort required to refine sufficient quantities of enriched Uranium to produce a nuclear weapon. On visiting Los Alamos in 1943, he declared that such a feat “couldn’t be done without turning the whole country into a factory,” before adding, “You have done just that.”
The effort to bring about the personal genomics revolution, pushing full genome sequencing into clinical applications, suggests the same scale of effort. Consider the sequencing data produced for just 100,000 people. It totals 30 petabytes and would require roughly 12 full racks of state-of-the-art compute nodes running for a year to process. To sequence the population of the U.S. would require a system 3,500 times larger: 42,000 racks operating continuously.
Yet it’s likely that each of us will be sequenced multiple times in our lifetime. Cancer patients may have their tumors sequenced on a recurring basis as treatment progresses, just as cholesterol levels are routinely monitored. The symbiotic bacteria in our bodies would be another sequencing target, not to mention agricultural products and livestock. Over time, we can compare our genomes with our neighbors’ looking for patterns and correlations, an obscenely large problem vaguely similar to the clustering algorithms run by Netflix and Amazon on vastly smaller data sets to recommend movies or books.
The computational burden of sequencing and analysis is immense, requiring trillions of CPU core hours. Processing data on this industrial scale will require huge data centers crunching bits day and night, and is indeed reminiscent of Bohr’s comments.
At this scale, two trends will emerge. First, genome processing itself will become a commodity service, as we are already witnessing (e.g. Complete Genomics, GenomeQuest, BGI). Organizations with the skills to run huge compute clusters efficiently (e.g. Google, Amazon, Microsoft) will likely become engaged in the market. The second trend will be a standardization of algorithms for mapping. For clinical applications, it makes sense that our genomes be processed using the same set of algorithms. This might come about by government decree or by mutual agreement from an industry board. Because sequencing machines differ in data output, there may be an approved algorithm for each vendor that conforms to a set of standards for accuracy and sensitivity.
Move to Commoditization
The trend to commodity data processing argues for the least expensive solutions that solve the problem to the rigor required and standardization of algorithms argues for the benefits of an appliance approach. Heterogeneous compute platforms (i.e. platforms complementing CPUs with other options) are a compelling option for genomic data processing because of inherent advantages in cost, power and footprint per unit of performance. Field Programmable Gate Arrays (FPGAs)—commodity chips used in a variety of networking, signal processing and consumer applications—are a particularly effective option (see, “Accelerated Methods for Bioinformatics Analysis,” Bio•IT World, Jan 2010).
FPGAs provide a huge fabric of simple processing elements that can be configured into almost any algorithm of interest. By implementing algorithms in hardware as opposed to software, large efficiencies in cost, space, and power can be achieved.
The suitability of FPGAs for bioinformatic computations has been appreciated since the late 1980’s. The Smith Waterman local alignment algorithm, for example, has become somewhat of a “Hello World” problem for FPGA-based bioinformatics. With just four base pairs, DNA can be represented by 2-bit data structures. The custom data size is easily accommodated by dense arrays of compact processing elements on FPGAs, unlike CPUs, which are made to move around data in 32- and 64-bit chunks. The bioinformatics processing kernels have a natural streaming character, which maps efficiently to FPGA architectures where temporal parallelism through pipelining can be exploited. Finally, the problem is massively parallel at multiple levels. Parallelism can be exploited on the billions of independent reads and on the number of positions that must be checked in the genome. In work sponsored by the National Science Foundation, we have found FPGAs to be at least 5x less expensive up front, 20x more power efficient and 15x denser than standard CPU solutions for equivalent performance on algorithms relevant to bioinformatics.
The explosion in genome sequencing will lead to the consolidation of data centers, commoditization of the process and standardization of algorithms, all of which argue strongly for the efficiency and performance gains that FPGAs afford. For personal genomics to flourish, improvements must be made along the whole pipeline from collection to sequencing to storage, analysis and interpretation. The rapid improvements in NGS have uncovered bottlenecks in other parts of the pipeline in computing and storage to downstream processing and interpretation. FPGAs are a superior, cost-effective solution ideally suited to solve the massive data crunching that personal genomics demands. •
Vincent Natoli is CEO of Stone Ridge Technology. He can be reached at: vnatoli@stoneridgetechnology.com