YouTube Facebook LinkedIn Google+ Twitter Xinginstagram rss  

Taking Next-Generation Sequencing Data to the Cloud

By Kevin Davies

September 15, 2009
| There’s a lot of interest in the cloud. In a sense, GenomeQuest has built the first commercial application-specific cloud for biocomputing.” So said Ron Ranauro, GenomeQuest president and CEO, marking the July launch of GenomeQuest 6.0Beta, a sequence data management solution that provides a web-accessible, cloud computing environment for researchers to “align and mine” next-generation sequencing (NGS) data. 

From the Internet, users can access the GenomeQuest cloud, namely “a 500-CPU compute farm for processing that’s purpose-built for processing volumes of sequence data.”

GenomeQuest is best known for mining sequence data for intellectual property. The strategy netted more than 100 customers, including most of the big pharmas, and several agricultural science customers. The strategy, Ranauro says, was always to create an enterprise sequence data management platform. But when would the market be ready? The arrival of the NGS era sparked what Ranauro calls the “catalytic event for causing the enterprise and academic markets to rethink the way they’re managing sequence data.”

Easy Button 
GenomeQuest 6.0Beta addresses the needs of three key constituencies—the researcher, the bioinformatician, and the IT manager. “The early visionary market for next-gen sequencing wants to do everything, but the mainstream market wants ‘the easy button,’” says Ranauro. “They also want some flexibility to tune parameters. They’re not interested in managing data but want common workflows.” GenomeQuest delivers production workflows for gene expression and variant (SNP) discovery. “Any researcher can self register and upload a file, or use the sample file and start getting results very quickly.” (See, “You want results...”).

Bioinformaticians, on the other hand, typically access data models and algorithms through an application programming interface (API). “We’ve put a tremendous investment in exposing the [API] at multiple levels. Since it’s a web application, there’s a URL API used to script and access any data or workflow or database in the system. There’s a scripted command line API which most bioinformatics developers will prefer.”

As for the IT manager, scalability is critical. “The volumes of these next-gen machines just continue to escalate,” says Ranauro. “A system that won’t scale is going to be a difficult investment to justify.”

Ranauro half-jokingly says GenomeQuest is becoming a web company. “Researchers can come to the site, self register, use a sample data set or upload their own, run workflows and mine the results.” The available sample data include metagenomic pathogen data (454), variant detection workflows, and gene expression data (Illumina, Life Technologies).

GenomeQuest 6.0 fits into the next-gen workflow from the generation of the raw data, picking up up the raw FASTA files, post image processing, and uploading that file. A multigigabyte file might take several hours. For bigger files, the sneakernet will suffice. (Ranauro says GenomeQuest is open to leveraging data-transfer services from companies such as Aspera.)

From a simple web application, the end user selects a reference genome and level of annotation. Are novel variants found in dbSNP? Are they found in coding regions? The result file is a sequence database of the assembly, which can be mined according to those properties. “Being able to mine and filter the results is the secret sauce of the scalable engine,” says Ranauro. “Now the biologists can do this work without needing to be a programmer, through a very simple web application. That’s the contribution we’re making—allowing a broader, mainstream audience to participate fully in next-generation sequencing.”

Biologists can select and create custom views of the appropriate reference sequence or subsets thereof. “It’s providing data management, but data isn’t really moving around or up and down from the server to the PC. All the manipulation is happening in the cloud but the user is able to manipulate [it].” The web architecture enables everything to be shared, including workflow, result databases, and selected views on results. “Those can be used as hypothesis drivers for the next set of experiments,” he says.

Upside and Roadmaps
While Ranauro has his sights set on mainstream users, he sees upside elsewhere. “In the fullness of time, a genome center is going to want to get onto the cloud, because they have to lower their costs, just the same as anybody else, to get to the $1000 genome. It might be that GenomeQuest’s platform provides a smoother path onto the cloud than taking all the in-house infrastructure and trying to recreate it on Amazon… We see ourselves providing the on-ramp to the cloud.”

Ranauro adds: “We’re actively looking at scaling options that might include Amazon. Hosting this on Amazon is a very real possibility, but it’s not currently on our roadmap.”

De novo assembly functionality is on the roadmap, however, for the second half of 2009. “We’ll provide the computational and alignment engine but we’ll rely on the industry for the assembly. There are important assemblers, such as 454 Newbler, today. For short reads, later this year—there we’ll rely probably on Velvet or Abyss.”

Ranauro also sees a rich environment for next-gen software companies such as CLC bio and DNASTAR to add value. “Those tools have a very rich feature set. There’s always going to be a researcher that can benefit.”

Ranauro says his product is alone in that it can process the data and then mine it using an easy-to-use web-based platform. “There’s a reason why the IT industry went from client-server to web-based. It provides centralized management, local control, more of a tractable knowledge engineering environment for an enterprise. We don’t see our customers wanting to move data up and down between PCs and servers or across networks. They really want to have it stored centrally but be able to manipulate it easily. We’re really the only company offering that.” 

You want results or not?!

After a stint with Eric Lander’s group at the Whitehead Institute in the mid ’90s, Richard Resnick tried his hand at industry, starting Mosaik (eventually acquired by NetGenics and LION), and a spell in the entertainment industry, before joining GenomeQuest.

“GenomeQuest has been an IP software company, but when I was consulting for the founder in ’02-’03, they had something I’d have died to have used as a bioinformatics director in ’96: this incredible powerful platform to do massive sequence data management and comparison.”

NGS creates a huge data management and analysis problem, but GenomeQuest addresses the next step: how do I make my new transcriptome the reference sequence? Or take 100 Illumina runs and identify variation across them? “Nobody knows how to do that. It’s such a nascent market.”

While Resnick feels GenomeQuest could prove useful to the genome centers. “They’re doing a lot of janitorial work they don’t need to be doing,” he says, “but they’re not our immediate commercial customers.” Instead, he is targeting—successfully—core labs with 3, 5, 10 machines running regularly, supporting university groups or medical schools.

Resnick sympathizes with the researchers presented with a stack of FAST-Q reads. Without a bioinformatics friend or group, they’re lost. “The core lab managers are struck with this problem—managing the sequence data and how do I give these results to our users? They’re not the Broad or Wash U—they don’t have that level of resources—but that makes them really hot customers for us.”

Resnick just signed the University of Florida, the first 454 customer, which he is installing as an “on the premises” platform behind their firewall. Another strong source of interest is agricultural biotech. Interest from pharma and biotech is picking up slowly, with Resnick highlighting areas such as toxicology, oncology, metagenomics, and infectious diseases.
GenomeQuest’s four workflows cover variant detection, RNASeq, rapid mapping/annotation, and assembly. There is demand for ChipSEQ and microRNA characterization, which will likely be added. The reference databases are continually updated. “Having the reference databases available, easily minable, and filterable is sine qua non for having accurate results,” Resnick insists.

He selects human chromosome 7, and within seconds, he is filtering the list of high quality SNPs found from a sequencing run that affect a protein translation and are totally novel. The answer is 19, and the list—and the filtering workflow—are saved for future use. Some clients still use sneakernets to upload their data. “Trucks and planes are sometimes faster than the Internet,” says Resnick. “Our transport speeds are increasing to line speed, but we’ll ship you a disk if you want one.”

GenomeQuest supports all the major platforms, including Helicos. Interested users can create a basic account and do sequence comparisons “in a massive way… in perpetuity,” says Resnick. “You can take anything you’ve uploaded, and compare it to any database you want.” The only part for which GenomeQuest charges is the high-end NGS workflows. “It’s less than the cost of the reagents,” says Resnick. He asks rhetorically: “You’ve invested thousands of dollars in a run. Do you want to get the results or not?”           K.D.
































This article also appeared in the September-October 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.

Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact Angela Parsons, 781.972.5467.