September 28, 2010 | David Smith, a researcher at the Mayo Clinic in Rochester, Minn., has no idea what to do with all these sequencing data. “I think I represent the vast majority of people who’ll be using these technologies,” he says wryly. “They’re not going to have Jill Mesirov and her team at the Broad standing behind them, or 500 college kids like BGI has just for bioinformatics. A lot of people are going to be very much on their own.”
The tectonic shifts brought about by next-generation sequencing—including the ability to sequence a human genome in 2010 for some $10,000 in about a week—have produced snags and solutions in almost equal measure.
Data “deluges” and “tsunamis” overwhelm scientists. Besides dealing with the idiosyncrasies of the major second-generation sequencing platforms—Roche/454, Illumina, and ABI’s SOLiD—they’ve faced a bases-versus-bytes battle: capacity-wise, these sequencers have left their computers in the dust.
In response, companies eager to help scientists access and analyze their data have blossomed in recent years. These vendors peddle programs to interface between platforms, assist with genome assembly, align reads to reference genomes, and increasingly act as intermediaries between researchers and the cloud.
There’s a rough taxonomy of software solutions for analyzing NGS data: desktop software, a web-based interface, or a technology enabler.
CLC bio, a major desktop software vendor, offers turnkey, “plug and play,” and comprehensive cross-platform analysis packages. Small research labs can purchase a CLC Genomics Machine, with 8 terabytes (TB) of standard storage; CLC Genomics Factory, offers storage capacities of up to 90 TB. Like most commercial options, both products are platform-agnostic, and customers can develop their own plug-ins through the software’s API.
“Our focus is on high-throughput sequencing,” says Lasse Görlitz, head of global marketing and PR. “We like to think of ourselves as the only commercial shop which gets customers through a sequencing project of analyzing huge [200 gigabases or more] datasets. It’s an intelligent approach to accessing your data; instead of juggling datasets from hard disk to hard disk, everything’s centralized, which removes a lot of overhead from your network.”
CLC bio’s clients—BGI is one of the company’s flagship customers—are mainly biotechs, academic institutions, and agricultural researchers. “We develop our own algorithms, unlike many other vendors who’ve mostly done adaptations or used open-source alternatives,” Görlitz explains. “We take scientific papers from major journals and optimize [their] approaches.”
For instance, the first African genome sequence (published by Illumina in 2008) required 80 hours and 21 computers to assemble using Ibis; CLC bio’s algorithm completed the task on a single computer in seven hours.
Any company touting “one-stop shopping” for NGS data analysis should offer a good laboratory information management system (LIMS), solutions for formulating pipelines for alignment, plus a statistical analysis system for sample comparison. “All three pieces need to be integrated across the stack,” says Todd Smith, CEO of Seattle-based Geospiza, which offers web interface software. It’s an expensive proposition and harder to find than researchers might expect.
Smith thinks NGS applications should be geared toward functional genomics. “A lot of the dialog in NGS gets stuck at the middle layer: how do I best assemble and align sequences to a [reference] genome? But that’s only a small piece of next-gen sequence analysis: 90 percent of the time, companies focus on data analysis in just that alignment space, but you need to integrate that with pathway and ontology information. The scientist needs to be able to explore that data and tell a story.”
Geospiza’s GeneSifter allows such comparative analyses. In a transcriptome comparison of tumor and normal controls, Gene-
Sifter can provide a principal component analysis to show how tumors are separating from normal cells, clusters of subpopulations within the patient groups, the gene list linked to pathways, and candidates for follow-up analysis.
As NGS has evolved, so has Geospiza’s client base. Early on, says Smith, clients were a mix of pharma, academia, and government institutions. “With NGS, [our clients] have changed to government and academia, but the technology simply hasn’t translated to pharma yet. I think it will, eventually; pharma’s just being cautious. We’ve worked with biotechs, though—you could say these are the research labs of industry.”
Asked about sequence data management (SDM), Smith comments that “it’s not the problem to solve, but you can’t do much without it. The best analogy is digital photo software: Aperture and Adobe Lightroom took different approaches to organizing data, with different trade-offs. It’s much more about individual preferences.” Naturally, Smith suggests his own SDM solution is “the only way to solve the problem.”
In fact, GenomeQuest, another company marketing web interface software, describes itself as “the leader in SDM,” with an “open and configurable” platform that uses RDBMSs like Oracle and MySQL, while the company itself handles all sequence data and reference annotations. At the core is the GQ-Engine, with intuitive and interactive browser-based data analysis tools that, as expected, promise accelerated workflows and discovery across multiple whole-genomes. The company also claims a couple of firsts: utilizing expertise around BLAST-type searches, they’ve successfully integrated the “largest collection of reference datasets,” Google-web-indexing-style. And at this year’s Bio-IT World Expo in April, CEO Ron Ranauro unveiled GQ-PMR, the “first genomic reference system for personalized medicine-based research,” whereby pharma companies can integrate raw data from public studies like the 1000 Genomes Project directly into their private research.
DNAnexus has a completely cloud-based pipeline, eliminating the need for dedicated computer hardware and software (see, “Cloud Convergence,” Bio•IT World, May 2010). Researchers send data online to DNAnexus, and it resides in the Amazon cloud while the company carries out quality control, read mapping, and the like. Data storage is pay-as-you-go, clients aren’t subjected to monthly fees or annual licenses. Notably, both Geospiza and GenomeQuest have also headed skyward, with GenomeQuest using its own cloud server.
Cycle Computing, meanwhile, sells no genomic analysis software, but offers HPC clusters utilizing the cloud. Basically, Cycle Computing is a customer of Amazon, and it enables customers to better utilize their own next-gen technologies. “In the middle is kind of where we’re at,” says Jason Stowe, founder and CEO. “We’ve been provisioning clusters inside AWS and enabling these to be run on-demand. Our goal is to remove the costs—administrative, overhead, equipment—associated with owning your own cluster; we provide these as a service on a cloud infrastructure.’”
The company’s product, CycleCloud, was built first and foremost from a data security perspective (see, “Cycle Computing’s Tour de Cloud,” Bio•IT World, Nov 2009). Condor grid technology is an important feature of CycleCloud, says Stowe, because scheduling is a key issue. “Essentially, if you’re [a large] university with 32,000 cores across your main campus plus a few satellite campuses, Condor helps run jobs on the dedicated resources so you’re using them all the time; it also keeps track of idle cores and harvests that processing power.”
Mayo Clinic’s David Smith uses NGS technologies to better understand molecular alterations in head and neck cancers. He also chairs a technology assessment group responsible for developing Mayo’s overall NGS infrastructure.
Smith has used GeneSifter since 2008. “There’s no question that Geospiza’s package has since matured dramatically, but there are many more tweaks that could be done,” he remarks. “It’s really important, for instance, that you can quickly subtract what’s in your normal sample from your tumor sample. Otherwise it’s a nuisance to look through thousands of variants without a clue whether they’re SNPs or just low-copy.”
Smith frankly admits that the number of available analysis solutions is overwhelming, requiring re-evaluation every few months. Still, he reckons that, at least for the foreseeable future, most users will opt for commercial software packages. “[NGS] equipment is cheap, but not the infrastructure around it, like storage and analysis,” he says. “People are going to discover that they hadn’t budgeted for all these hidden costs, and it’ll be sensible to opt for ready-made packages.” For each user, the key is identifying the features that matter most, then finding software to match their needs. “It boils down to users answering the question: ‘What am I trying to get out of NGS?’”
Cornell’s Qi Sun, co-director of the core facility, prefers a grab-bag of solutions—many of them open-source (see, “Open Options”)—for his bioinformatics research. He shuttles between using the BWA aligner, software from Real Time Genomics, and Malaysia-based Novocraft’s Novoalign. For genome assembly, he says, nothing beats Roche’s free Newbler, or Velvet (for Illumina reads).
For RNA-Seq and ChIP-Seq, Sun stresses the importance of software that can handle customized annotation databases. “If you load human, mouse, or fly data files, you’ll get answers quite easily,” he remarks, “but not if you’re working on the soy bean. This might not be important from a medical research standpoint, but we’re [also] an agricultural school and we think differently.” As for variation detection, Sun currently favors SoftGenetics; their software displays data in a spreadsheet rather than separate lists, making comparisons easier.
Sun has a slightly different perspective in his other role as co-director of a busy core lab. There, he seeks more of a DIY environment when handling myriad requests for project aid, and likes that much of today’s commercial software offer good visualization and an easy-to-use interface. He’s impressed by both GenomeQuest and DNAnexus cloud services, although he’d like to see the bandwidth issues with transferring data improve. “According to GenomeQuest, they’ll install a local file server at your organization, so all you need to do is move your data to this “sandbox” and they’ll take care of getting it from there to their cloud cluster,” he says. “The reason we haven’t jumped in is because we currently have our own solution [BioHPC].”
Although there’s currently no dominant software for analyzing NGS data, Sun reckons it may simply be a matter of time. “Ten years ago, microarray technology was in its infancy and there were plenty of vendors trying to come up with software solutions; nowadays, GeneSpring leads the field,” he says.
David Smith, however, points out that microarrays didn’t develop at the NGS pace. “Nobody can afford to wait,” he says. “Two years ago, you sequenced one genome and that was a Nature paper. Now it’s different; Elaine Mardis’s group had to sequence 25 pairs of breast cancers that weren’t responding to chemotherapy, and the bar keeps getting raised. If you want to publish in the top journals, you have to use the most cutting-edge [NGS] technology and analytics, as quickly as possible.”
Connecting NGS to population data is a promising area, Sun remarks. “The reason most of us moved to NGS is because we want to fill the gap between sequence and population. There’s no commercial software yet that meets this need.”
“The technology needs to move forward to a point where average investigators find it affordable and usable,” says Smith. “Right now it’s still a bit out there, and for users like myself playing with it—it’s like kids with a nuclear weapon that could easily blow up in their face.”
Also, Smith adds, developers should be designing mature analytics so scientists can parse and interpret all that data. “You eventually want to deal with something you can get your mind around,” he says.
“We’re approaching a world where we can readily look at full genome and transcriptome assemblies, and we’re going
to want to make clinical decisions based on what we see. The true challenge here will be integrating data with biology, and it’ll be interesting to see who begins to tackle that.” •
This article also appeared in the September-October 2010 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.