Arpeggi Adds Genome in a Bottle Consortium Data to GCAT

By Matthew Dublin

July 15, 2013 | Texas-based bioinformatics startup Arpeggi (see, “Arpeggi’s Harmonious Approach to NGS Data Analysis”) has announced the addition of data from the Genome in a Bottle Consortium to its online Genome Comparison & Analytical Testing (GCAT) toolkit. GCAT is a freely available cloud-based platform for evaluating the accuracy of next-generation sequencing (NGS) analysis pipelines that provides performance reports which users can share and discuss with the community.

Released in April at the Bio-IT World Conference & Expo held in Boston, Arpeggi’s GCAT has been steadily adopted by both academic investigators and commercial vendors to compare the performance of their NGS alignment and variant calling software tools. Users can simply download and analyze data on their local resource and then upload the results to GCAT to compare against various performance metrics. Benchmarks include coverage depth, correct to incorrect read mapping ratio, and transition/transversion ratios. After analysis is complete, GCAT produces visual reports on results and performance that then can be compared to reports uploaded by other users.

GCAT, which is hosted on Amazon Web Services' cloud, was originally developed to benchmark and evaluate the Arpeggi engine, the company's proprietary variant caller. In addition to the Genome in a Bottle Consortium data, GCAT contains four real exome datasets produced by Life Technologies and Illumina sequencing platforms.

Now that GCAT has been fortified with the Genome in a Bottle Consortium data, researchers have access to highly confident genotype calls that can be easily employed to score the performance of analysis pipelines.

"The value of the Genome in Bottle data is that it’s a ‘truth set,’ they have areas of this genome for which they have developed highly confident genotype calls,” said David Mittelman, an associate professor at Virginia Bioinformatics Institute and Arpeggi's chief scientific advisor. “So by integrating that data into GCAT, we’re offering a great metric you can use to evaluate your performance.”

The Genome in a Bottle Consortium is an international effort spearheaded by the National Institute of Standards and Technology (NIST) aimed at developing reference materials for human genome sequencing in order to assess the performance of NGS platforms. The consortium recently released an early set of highly confident calls based on publicly-available genome data and is currently working to characterize that data with greater detail.

“I thought the GCAT tool kit would be a good platform for people to understand the performance of their sequencing image by comparing their variant calls to our highly-confident genotype calls we’re developing as part of the Genome in a Bottle Consortium,” said Justin Zook, consortium leader and a biomedical engineer at NIST. “It's the first tool that I've seen that allows you to compare lots of different methods in the same way, as opposed to everyone doing their own validation of their own methods, so it's nice to have a centralized resource like this.”

Early next year, the consortium is planning to release a highly characterized HapMap sample together with a set of highly confident genotype calls. In a collaboration with Coriell Cell Repositories, the consortium was recently able to grow a large quantity of NA12878 cells from the HapMap sample set, totaling roughly 8,000 vials of 10 micrograms each. Consortium members will be able to request these samples next year.

But it’s not just GCAT that benefits from integrating the Genome in a Bottle Consortium data. Participants in consortium are using GCAT to add another dimension to their published research as well. Authors can include links in published papers to their GCAT results in order to publicly share information on their pipelines and performance results. Zook and his Genome in a Bottle Consortium colleagues currently have a paper under review at Nature Biotech that links to data taken directly from GCAT. The authors leveraged CGAT to generate figures comparing different sequencing and bioinformatics methods to the highly confident genotype calls they generated for their pilot candidate NIST Reference Material.

In one example, Novoalign, BWA, and BWAMEM, and Bowtie2 are compared to determine performance in terms of the amount of false negative and false positive results that are returned. The team found that Novoalign, which uses the Needleman-Wunsch algorithm to find optimum alignments, has the lowest number of false positives while BWA-MEM, a version of the Burrows-Wheeler Aligner, is better at eliminating false negatives. Bowtie2 seemed to lag behind the other three tools, reporting a high number of both false negative and false positives results.

In 2013 BWA-MEM and Novoalign3 for example show continued improvement. In the above, you want the plot to be most to the right and to the bottom.

Another example looked at the various results provided by two versions of the Broad's Genome Analysis Toolklt (GATK) in combination with Novoalign against Illimina's iSAAC genome alignment tool. According to performance reports from GCAT, GATK with Novoalign demonstrated a much lower rate of reporting false negatives and positives compared to Illumina's tool.

Zook and his colleagues have already observed significant disparities among tools operating on the consortium’s datasets.

“So far, running our datasets on GCAT has shown us that some tools are being very aggressive but they lose a lot in terms of false positives, while other tools are very conservative but you don't get to see how much stuff is missing,” said Zook. “So the ability to see those differences in performance is a real key feature that is critical and missing in our community discussion of tools.”

Future plans for GCAT include the integration of other well-characterized genome data sequenced at exceptional depth, such as Illumina’s Platinum Genome datasets, as well as more comparison functionalities which they hope will eventually improve the lack of continuity among variant callers and some mappers.

“GCAT is a also template for new ways to do science,” says Mittelman. “Not just the openness and transparency; but also the collaborations between academic labs, government institutes (like NIST) and for-profit companies. Especially in today's economy, we need to be more creative in how we approach scientific problems (and how we fund them). I like to think that with GCAT we were also experimenting with how to experiment.”