By Allison Proffitt
March 20, 2014 | The Broad Institute has announced that a new version of the Genome Analysis Toolkit (GATK), version 3.1, has been released and has been optimized for Intel Advanced Vector Extensions (Intel AVX) found in Intel Xeon-based servers. The improvements account for faster variant calling, achieving three to five times overall improvement in variant discovery, enabling a whole genome to be analyzed in one day rather than three.
The update comes on the heels of version 3.0, a significant GATK update released last week, said Eric Banks, Group Leader of Genome Sequencing and Analysis at the Broad Institute, and represents work by both the GATK team and Intel to optimize the software.
“We used to have this relatively fast, and not completely accurate tool for mutation detection for SNPs and short indels called UnifiedGenotyper. It did a good job with SNPs and a pretty bad job with insertions and deletions,” Banks tells Bio-IT World. “So we spent a long time developing a much more sophisticated tool called Haplotype Caller… it’s much more accurate, but it’s much slower. So we had a problem. Many of our users were hesitant to switch over to the new tool. Even though it was better, it was much, much slower… The Broad’s actual pipeline wouldn’t switch over because it was too slow.”
When Intel promised a 2x improvement in speed out of the box with just the right configurations for hardware—and much higher improvements with minimal software optimizations, Banks says the team was very intrigued.
“To us that’s important because it means people can actually use our tools in a production setting. Not just with a laptop looking at one sample... but we’re talking about trying to look at mutations with 26,000 samples like a Type 2 Diabetes project we just completed,” Banks says. “When you’re on that scale, every order of magnitude counts.”
Wave of Improvements
Banks says the 3.0 release gave the GATK team the opportunity to try several new things and change some of the assumptions made in the past about how the software should work. For instance, version 3.0 allows users to call samples one at a time.
“Instead of having to take 26,000 samples of sequencing data, stick them all into memory at the same time, and try to find SNPs for all of them, you can do it one sample at a time… It’s very cheap, and it means the data doesn’t have to all be together in the same file system. It’s technically much simpler and computationally much faster. At the end there’s a joint genotyping step, which is very cheap because it’s done in the… VCF.”
These improvements have already gotten rave reviews from “super users” in the GATK community, Banks says, and he expects the 3.1 optimizations
to be just as well received. The GATK team’s v3.0 optimizations coupled with the optimizations from Intel released in v3.1 already achieve a 2x-4x speed increase.
Together with the improvements released in version 3.0, GATK 3.1 can now analyze data sets consisting of 100,000 of DNA samples, and Banks says the team is working on capabilities to scale to millions.
“And this is the first pass,” he says. “The initial, vanilla, plainest thing you can do.”
Banks says that further optimizations are already being implemented that will give another 2x speed increase, and more are to come. The GATK team is just ensuring that each round is integrated well before release.
Intel on the Inside
Intel first came to the Broad Institute
with ideas for how to improve the GATK, and the collaboration has grown to include work to speed the GATK software and efforts to improve the whole production pipeline.
For the improvements rolled out with v3.1, Intel optimized the GATK software to work with Advanced Vector Extensions. Intel AVX is a 256-bit instruction set extension to Intel SSE found in Intel platforms ranging from notebooks to servers, and is designed for applications that are highly compute intensive.
But the collaboration has been a very broad and fruitful one, and has contributed observations and suggestions for improving the entire pipeline, Banks says. “Good things happen when you get smart people in the same room.”
For example, Intel suggested a way to rethink how duplicates are marked in the BAM file. “It was an amazingly insightful observation,” Banks says. “It was a big conceptual improvement to get that done.”
The collaboration is ongoing and Banks says there are a list of improvements that the two groups are working together on implementing and rolling out as soon as possible.
What to expect immediately? Further improvements in the speed of the Haplotype Caller and general improvements in the whole GATK engine. And after that, Banks says to look for more improvements inside the production pipeline.
“We met with them last week and gave them a whole list of things,” Banks says, “and we’re letting them choose what the next project is that they’ll work on.”
GATK 3.1 is available for academic, noncommercial use through the Broad Institute’s web site at http://www.broadinstitute.org/gatk/. Commercial and for-profit users can license GATK 3.1 through Appistry at http://www.appistry.com/gatk.