By Kevin Davies
September 28, 2010 | Computer scientist Toby Bloom has been the director of informatics for the genome sequencing group at the Broad Institute since 2001, facing the brunt of the NGS data deluge.
She manages a team of two dozen people, mostly software engineers and Java programmers. The team tracks everything from the time samples enter the building to the processing of the data. That includes sample inventory, project tracking, the LIMS (Laboratory Information Management System), the analysis pipeline, alignment, quality control, and SNP (single nucleotide polymorphism) calling. A data repository/content management system makes all that sequence data available to the researchers.
The LIMS is homegrown and has been rebuilt many times, Bloom says ruefully. Because of the Broad’s scale and the speed of change, she says it’s impossible to find a single commercial product capable of meeting all her needs. The team also built a pipeline manager to specify the workflows for particular types of sequence analysis. “It lets us handle many pipelines at once,” says Bloom. “It lets us pull in information from our instruments, our LIMS, and our sample repository to decide what to do on the fly.”
Currently that pipeline is handling 0.5-1 terabases of sequence a day. That’s a lot of processing, she says. “We want to make sure that if something fails—because we’re running 2,000 compute cores at a time, things will fail, servers will drop out—it can track where everything is, what’s failed, what’s stuck and hasn’t turned up. Our goal is to be able to restart from the last step automatically without a lot of human intervention.”
Within the pipeline manager, Bloom runs Illumina’s own software. The balance of processing on and off the instruments evolves over time. “We started out doing the image processing off the instrument, but it got to the point where [Illumina] could do the image processing reliably enough on–instrument that we could use that. Then we started pulling the intensities from the instrument, instead of images.”
With the new HiSeqs, Bloom hopes the base calling can be done on the instrument—for now it’s being handled in the pipeline—so she can focus on the downstream processing: creating BAM files, recalibration, alignment, deduplication, quality analysis, SNP calling, and so on.
Bloom would like to share more of her resources with the academic community, but points out that it’s difficult to release pieces of code that depend on Broad databases or other internal metadata. “We’ve released some of our BAM file processing tools, the Picard tools, publicly,” she notes.
The Sample Deluge
Surprisingly perhaps, Bloom says the biggest impact of the NGS data deluge hasn’t been the data volume as much as the effect on the lab and the LIMS. “It was more the change in the number of libraries we were making, the number of samples in the lab at once, than the actual amount of data,” says Bloom. With hundreds of samples potentially in the same step in the lab at the same time, the LIMS now employs multiple layers of tracking.
As lab processes change quickly, Bloom’s team is often building software ahead of the process in a variation of agile software development. “We’re trying to get enough working that [the researchers] can function and get the data they need, before they really know what they need from us,” she says. “We need to build it in small pieces and then add to it without rebuilding what you already did.”
Early access to NGS instruments means that Bloom’s team often takes beta versions of the vendor software as well. “We’re expected to find their bugs. We can’t wait for their official release,” she says. There are weekly calls with Illumina, for example, about the informatics.
Bloom is waiting to see what the implications of the new 3rd-generation platforms might be for her team. Changes to the sample prep process have a direct impact on the LIMS, but another variable is the size of the data—the bytes/base ratio. With the HiSeq, for example, Bloom says Illumina has essentially reached one byte/base.
Another future priority is exploring cloud computing. While smaller centers that don’t have the Broad’s IT infrastructure are looking for solutions to handle the massive data volumes and complexity of the processing, the Broad’s interest is more about sharing data from major collaborative projects such as the 1000 Genomes Project with other large labs and genome centers. “Getting the data back and forth from NCBI or other centers is a burden,” says Bloom. “Moving the compute to the data seems very much more efficient than having to keep moving the data every time you want to compute. So that’s the model: can we put the data in one place and move the compute to the data as needed?”
One idea is to put the Broad’s pipeline manager on the Amazon cloud and “let people go to the cloud and figure out how to use it for their own pipelines.” Bloom admits there’s a lot of work ahead. “This only works if it’s easy to get the data in the cloud. It’s clearly not the kind of application that was targeted originally by the public cloud vendors.”
This article also appeared in the September-October 2010 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.