BGI debuts new journal, cloud services, and tools.
By Allison Proffitt
August 2, 2011 | SHENZHEN, CHINA—At the Bio-IT APAC conference hosted by BGI in Shenzhen in July*, researchers announced two new Cloud-based software-as-a-service offerings for nextgen data analysis, several new open source assembly tools, and launched a new journal for next-gen data.
Structural variation requires a “totally different scale of technology,” said Yingrui Li, explaining that structural variations are more unique to individuals than SNPs. He called for more de novo sequencing, suggesting that whole genome de novo assembly would be offer a more comprehensive structural variant map. Li’s message aligned well with updates to the SOAP algorithms. The Short Oligonucleotide Analysis Package gained a de novo short reads assembler (SOAPdenovo 2), the alignment tool (SOAP3-GPU/CPU), a graph-based indel finder (SOAPindel) and assembly-based structural variation finder (SOAPsv). These updates join the existing alignment tool (SOAPaligner/soap2) and re-sequencing consensus sequence builder (SOAPsnp). SOAP3 reflects improvements in two branches of the algorithm. GPU-accelerated alignment with BWT could take 2.6 seconds to perform exact matching for 1 million 100bp reads. SOAP3-CPU shows improved accuracy over SOAP2 with similar speed. SOAPdenovo 2 is designed to assemble human-sized genomes, and reflects algorithms changes to contig construction, scaffolding, and gap closure. The SOAP toolkit is available at http://soap.genomics.org.cn/.
Hecate and Gaea (named for Greek gods) are two “flexible computing” solutions for de novo assembly and genome resequencing that make the most of the new SOAP algorithms. These are “cloud-based services for genetic researchers” so users don’t need to “purchase your own cloud clusters,” said Evan Xiang, part of the flexible computing group at BGI Shenzhen (see, “BGI Cloud on the Horizon,” Bio•IT World, Jan 2011). Hecate will do de novo assembly, and Gaea will run the SOAP2, BWA, Samtools, Indel, and BGI’s realSFS algorithms. Xiang expects an updated version of Gaea to be released later this year with more algorithms available.
Flexible computing, explained Xiang, is a more efficient cluster architecture than traditional Cloud. Jobs of different types are grouped on the cluster to make the most of computing power and address scalability issues. For instance, CPU intensive jobs are grouped; memory intensive jobs are grouped; and input/output intensive jobs are grouped.
Both the Hecate and Gaea services will run on the BGI compute cluster because “Amazon is slow,” Xiang said. Running the services on an in-house cluster also alleviates any internet access issues.
Hecate is based on a series of distributed algorithms to recognize and simplify non-branching repeat-free regions of the genome, correct errors and resolve the ambiguous bubbles and short repeats, together with the distributed graph shrinkage algorithms to construct a linear DNA sequence. Based on BGI’s SOAPdenovo and SOAP2 algorithms, Hecate is more scalable than those algorithms alone.
Xiang presented results from speed comparisons showing significant cost and time savings using Hecate for de novo assembly. Running SOAPdenovo on a single server for 70 hours resulted in 80% genome coverage at a hardware price of $150,000. Using 96 Hecate cores, the genome coverage increased to 84% in 42 hours at a price of $60,000.
Gaea is designed to distribute resequencing computation to a cluster of nodes based on the Hadoop Streaming framework with personalized algorithm interfaces for SOAP and BWA. For the current version of Gaea (v1.2), Xiang reported speed increases of 75x for SOAP2 and 90x for BWA using 100 cores. At 400 cores those numbers rose to 300x and 346x speed increases compared to running either algorithm on a single core. Xiang expects Gaea v2.0 to see further improvements.
Gaea is also optimized for a biomarker analysis toolkit that includes SOAPsnp, DIndel and realSFS for SNP calling, indel calling, and gap alignment.
Also at the event BGI formally announced its new journal, GigaScience, which will launch in November 2011. Co-published by BGI and BioMed Central, GigaScience is an integrated journal and database, said Scott Edmunds, editor of the journal. GigaScience plans to stress usability and reproducibility in its review process. The journal will solicit “big data” studies and hopes to provide a forum for dealing with the difficulties of handling large-scale data from all areas of the life sciences. In addition to traditionally peer reviewed papers, GigaScience will publish citable datasets, each with permanent digital object identifiers (DOIs). Datasets will be hosted on the BGI cloud along with the SOAP toolkit and other BGI products. This will facilitate tool testing, Edmunds said, as the tools and data are in the same place.
Having DOIs for datasets will enable researchers to cite datasets used in their work and, Edmunds hopes, speed data release and dissemination. “Dealing with data is not just about storage,” he said, “but dissemination too.” To prime the pump, BGI released eight animal genomes, each with a DOI that enables the dataset to be freely used by researchers and then cited in publications. •
This story also appeared in the 2011 July-August issue of Bio-IT World magazine.