Odds and Ends from the 2016 Bio-IT World Conference and Expo
By Aaron Krol
April 15, 2016 | The 15th annual Bio-IT World Conference & Expo met last week in Boston, and as always, the event sparked some lively conversations about the direction of genomics, personalized medicine, and large-scale scientific computing. This year, those discussions included the coining of “cloud sobriety” by Chris Dagdigian of BioTeam, who predicted in his staple “Trends from the Trenches” talk that some major life science institutions will soon pull back from their cloud deployments as they start to realize their ongoing storage costs.
A debate was also reignited on the relative benefits of whole genome versus whole exome sequencing for clinical diagnostics, when keynote speaker Howard Jacobs came down firmly on the side of genomes. (If you’re prepared to get sucked down a rabbit hole, check out the Storify put together by Keith Robison of the Omics! Omics! blog—plus the earlier Storify we did on the same question in December.) We eagerly await a good study on the costs, benefits, and diagnostic yields of both techniques; this is clearly information the community needs.
Elsewhere at Bio-IT World, we’ve covered keynote speeches delivered by Heidi Rehm, on sharing and curating clinical genetics data; by Jacobs, who advocated for better access to genome-wide testing for patients with rare disease; and by a panel including Yaron Turpaz, Catherine Brownstein, and Bill Evans, who discussed ways to integrate big clinical and omics data into the practice of healthcare. We also named the winners of the 2016 Bio-IT World Best Practices Awards, for forward-thinking IT projects, and Best of Show Awards for outstanding new products, while Bioinformatics.org named Benjamin Langmead the latest winner of the Benjamin Franklin Award for contributions to open science.
But even outside these headline events, speakers at the conference showcased plenty more exciting ideas and announcements. Here are a few of the coolest talks we caught at this year’s expo:
Sheila Reynolds, a Senior Research Scientist at the Institute of Systems Biology (ISB) in Seattle, discussed her team’s progress on building a Cancer Genomics Cloud. ISB was one of three organizations that the National Cancer Institute chose in 2014 to create these portals for easy access to data from The Cancer Genome Atlas (TCGA). The first Cancer Genomics Cloud, developed by Seven Bridges, came online just this February, and won a Best of Show Award at the expo last week.
ISB has been contributing to TCGA, which now contains over a petabyte of tumor sequencing data, since 2009, building visualization tools and helping to characterize mutations from different tumor types. But as Reynolds said, the sheer size of TCGA now severely limits the number of institutes that can usefully work with the data. Through the Cancer Genomics Clouds, TCGA data will now be made available in publicly accessible cloud environments, alongside toolkits for custom analyses, so even smaller research groups can dig through the entire Atlas.
ISB’s Cancer Genomics Cloud, now in a “community evaluation” phase, is hosted on the Google Cloud Platform, and will eventually take advantage of the Google Genomics engine optimized for working with DNA reads and variant files. ISB has also curated all the open access data in TCGA into huge query tables, so users can make rapid searches by any clinical parameter. “You can host queries on this data, and that data will get sent to tens of thousands of machines, and you can scan hundreds of gigabytes of data in 30 seconds,” Reynolds said, adding that the same capabilities will be available for any user-contributed data brought into the Cancer Genomics Cloud. “Whether you want to share that data with the public, as we have, or keep it private or just share with your collaborators… you can do SQL joins and queries that combine your data with this wealth of TCGA data.” As with the solution from Seven Bridges, the ISB aims to eliminate any downloading or data preparation steps that would prevent people from accessing the entire TCGA dataset for custom research projects.
Adam Kiezun, Senior Group Leader for Computational Methods Development at the Broad Institute, shared details on the latest major update to the Genome Analysis Toolkit (GATK). This package of tools for finding genetic variants in sequencing data is one of the most widely used collections of software in the field, a foundational part of many analytics pipelines in both academia and the biotech industry. As Kiezun told the audience at Bio-IT World, the Broad Institute employs 15 full-time staffers just to work on GATK, including five devoted entirely to user support and training.
The latest update, GATK4, will include both new features unrolled gradually over the course of the year, and a fundamental re-architecting that will be made available around year’s end. GATK4, unlike previous incarnations, will run inside the Apache Spark framework, a cluster computing structure wherein multiple transformations can be performed on a single dataset in memory, without writing the results of each function back to disk. Because the processing of raw DNA data is highly iterative, cutting out steps of writing and reading files can offer huge efficiencies.
Eventually, Kiezun said, the Broad wants to let GATK4 users go all the way from sequencing reads to variant calls and quality control steps before having to write any results to a file or database. “Even at this early, alpha stage,” he said, “GATK4 based on Spark is really fast. It can do things that were impossible before and makes them easy. Collecting coverage across the whole genome, a 200 gigabyte file—four minutes.” That speed, Kiezun added, can change the very nature of a bioinformatician’s job. Users of GATK4, for instance, might want to calculate the distribution of insert sizes in their data as a routine part of analysis, because it now takes minutes instead of days.
New tools, meanwhile, aim to make GATK4 a much more comprehensive package. Instead of restricting the toolkit mainly to small variants like single-nucleotide polymorphisms, the Broad Institute is developing tools to identify copy number variations and, eventually, large structural variants. They are also adapting successful tools from partners, including a pathogen detection program from the Dana-Farber Cancer Institute. “[It] takes a supposedly human genome, and says, can I find any pathogens in this?” said Kiezun. “Can we find any known ones, and if we don’t know them, can we assemble the genomes?”
Jason Bobe, Director of the Sharing Lab at the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai, gave attendees a sneak peek at the Resilience Project. The project aims to find healthy individuals whose genomes contain mutations that would normally be predicted to cause serious, often life-threatening diseases. In theory, such individuals might have additional, protective mutations that could provide a pathway toward new treatments. The Resilience Project published its first, preliminary results from a retrospective study this Monday.
As Bobe explained, the Resilience Project can call on some remarkable precedents. In the mid-1990’s, for example, a man named Stephen Crohn connected with researchers at an AIDS research center in New York, trying to find an explanation for why he had not developed AIDS more than a decade after his partner had died of the disease. In the lab, it was discovered the Crohn’s blood cells were resistant to HIV infection, even at massive doses of the virus. “It was from this one person,” Bobe said, “that we identified a protective genetic mutation in the CCR5 molecule… HIV requires this molecule to bind with the cell surface and integrate and do its damage, and his was broken.” Today, the drug maraviroc, inspired by this observation, is a part of many HIV treatment cocktails.
The Icahn Institute is now in the late stages of planning a massive study, with the goal of recruiting as many as a million individuals, to search for protective variants like Crohn’s in a systematic way. Bobe is particularly interested in how this Resilience Project can attract healthy people to such a diffuse research study, and keep those volunteers engaged as researchers narrow in on their best leads. “Most people don’t participate [in biomedical research],” he said. “And those who do are those who are managing very serious diseases. We want to turn disease research upside down.”
Bobe has wrestled with this issue before, as part of the Personal Genome Project and now Executive Director of PersonalGenomes.org. (We’ve previously spoken with him about the Open Humans network, where volunteers can aggregate and publish data from different studies they’ve participated in.) He advocates for research structures where participants are given a usable interface that is rewarding to work with and teaches them something interesting about their data—even if, as in the Resilience Project, the vast majority will not prove to have the “resilience” variants the researchers are looking for. His team is working with the Apple ResearchKit to give volunteers an easy on-ramp for joining the project through their smartphones, something Mount Sinai has done in the past with asthma patients.
Robert Grossman, a Senior Fellow of the Institute for Genomics and Systems Biology at the University of Chicago, offered a primer on the NCI Genomic Data Commons, a project to harmonize legacy DNA data for directly comparable analyses. Like the Cancer Genomics Clouds, with which it will interoperate, the Commons will be accessible to researchers everywhere through an online portal. At its heart will be three petabytes of genomic data that can be re-processed, from raw sequencing reads, through a shared analysis pipeline.
There is an urgent need for this kind of resource, Grossman said, because studies are increasingly combining large datasets from multiple sources. If these studies start not from raw data, but from processed information like files of genetic variants, they may be drawing faulty comparisons. “A lot of the papers you read in Science and Nature are making inferences from non-harmonized data,” he said. If you then look at where statistically interesting differences arise in that data, “the source of the data, the processing method, may be the most dominant factor.”
Moreover, bioinformatics is a fast-changing field that is constantly improving the accuracy and sensitivity of its tools. “You need the ability to reanalyze [your legacy data] with the most current algorithms for variant detection,” Grossman said.
Multiple, complementary projects to deal with these problems are now underway. The FDA recently launched precisionFDA, a platform where researchers can run their preferred analysis pipelines on a shared set of sequencing reads, to publicly compare results and learn how different tools produce different answers. That project received the top prize last week in the Informatics category at the Bio-IT World Best Practices Awards.
The Genomic Data Commons, however, isn’t trying to judge particular pipelines; it simply aims to make available a large set of data that is directly comparable. “We want a critical mass of data that is in commons and can be shared,” said Grossman. He encouraged other heads of research institutions to build their own commons, whether on their own computing infrastructures on in cloud platforms, with structures in place to interoperate with other groups’ commons. Key features for these commons include the ability to support multiple digital IDs with attached metadata; free access to data between commons, although it may be pay-to-compute; and access through APIs rather than by download.