NIH Launches Data Commons Pilot With 9 Projects

By Bio-IT World Staff

November 7, 2017 | Yesterday the National Institutes of Health announced twelve awards totaling $9 million in Fiscal Year 2017 to launch a National Institutes of Health Data Commons Pilot Phase.

The NIH Data Commons will be implemented in a four-year pilot phase to explore the feasibility and best practices for making digital objects available through cloud-based collaborative platforms. The goal of the NIH Data Commons Pilot Phase is to accelerate biomedical discoveries by making biomedical research data Findable, Accessible, Interoperable, and Reusable (FAIR) for more researchers.

“The NIH Data Commons Pilot Phase will create new opportunities for research not feasible before,” said NIH Data Commons Pilot Phase Program Manager, Vivien Bonazzi. “Making biomedical data sets accessible and connected at an unprecedented scale will lead to creative new ways to combine, analyze, and ask questions of the data to generate new knowledge.”

Bonazzi has been working on an NIH commons and the NIH Big Data To Knowledge (BD2K) program since 2013 and she acknowledges that the idea of a data commons requires a culture shift within science to reward data sharing. “Organizations will be defined by their digital assets,” Bonazzi said last year at the Converged IT Summit. “The most successful organizations of the future will be those that can leverage their digital assets and transform them into a digital enterprise.”

The NIH Data Commons awards offer a first step incentive.

The recipients of the 12 awards will form the nucleus of an NIH Data Commons Pilot Phase Consortium in which researchers will start developing the key capabilities needed to make an NIH Data Commons a reality. These key capabilities—identified by NIH—collectively represent the principles, policies, processes, and architectures of a data commons for biomedical research data. Key capabilities include making data transparent and interoperable, safe-guarding patient data, and getting community buy-in for data standards.

Three NIH-funded data sets will serve as test cases for the NIH Data Commons Pilot Phase. The test cases include data sets from the Genotype-Tissue Expression and the Trans-Omics for Precision Medicine (TOPMed) initiatives, as well as the Alliance of Genome Resources, a consortium of Model Organism Databases established in late 2016. These data sets were chosen based on their value to users in the biomedical research community, the diversity of the data they contain, and their coverage of both basic and clinical research. While just three datasets will be used at the outset of the project, it is envisioned the NIH Data Commons efforts will expand to include other data resources once the pilot phase has achieved its primary objectives.

FAIR Projects

Nine of the awards went to projects to build an NIH Data Commons, all subscribing to NIH’s comprehensive vision for an interoperable, FAIR (Findable, Accessible, Interoperable and Reusable) compliant, multi-cloud NIH Data Commons founded on open source and open standards.

University of Maryland NIH Data Commons Facilitation Center A collaboration of University of Maryland and University of Oxford researchers.

Development and Implementation Plan for Community Supported FAIR Guidelines and Metrics: a collaboration between Maastricht University; Icahn School of Medicine at Mount Sinai; Deloitte Consulting; University of Oxford, Oxford e-Research Centre; and the University of Miami. In the University of Oxford’s announcement of the awards, Avi Maayan, Director of the Centre for Bioinformatics at The Icahn School of Medicine at Mount Sinai said: “Our work will focus on a development and implementation plan for community supported FAIR guidelines and metrics.”

Patient-centric information commons under FAIR principals (PIC-FAIR): a team of researchers from Harvard Medical School. Several of the team including principle investigator Issac Kohane already work with PIC-SURE, an NIH Big Data To Knowledge program, creating a patient-centered information commons where the focus is the alignment of all available biomedical data per individual to enable the large, longitudinal studies essential for understanding causation and addressing key complexities of the patient-centric outcome research studies.

The Commons Alliance: A partnership to catalyze the creation of an NIH Data Commons: a team of researchers from University of Chicago; University of California, Santa Cruz; and the Broad Institute. Several of the researchers from the Commons Alliance team along with colleagues posted their vision for a Data Biosphere last month.

The Commons Alliance Platform will be designed to handle a heterogeneous mix of data types, including genomics, transcriptomics, and image data, along with associated metadata, the team reported when they announced their award. Ultimately the goal is not one monolithic system to handle all biomedical data, but rather a set of common software modules for creating interoperable systems, which could all reside within a common cloud-based research environment, explained Benedict Paten, assistant professor of biomolecular engineering at UC Santa Cruz and director of the UC Santa Cruz Genomics Institute’s Computational Genomics Lab.

“What we’re building is essential to the future of biomedical science, because it will allow us to ask questions we couldn’t otherwise ask and do things on a scale we never could before,” Paten said. “At the nuts-and-bolts level, it’s a big software engineering project, but its impact will be completely transformational.”

In addition to $917,000 in initial Data Commons Pilot Phase funding for the Commons Alliance, the Commons Alliance partnership was also awarded $5.8 million from the National Heart, Lung, and Blood Institute (NHLBI) for the integration of NHLBI data sets with the NIH Data Commons, including data from the Trans-Omics for Precision Medicine (TOPMed) Program.

A Commons Platform for Promoting Continuous FAIRness: a team of researchers from the University of Chicago and University of Southern California. The project will make use of Globus, a widely-used platform for transferring, sharing, and discovering research data developed by University of Chicago and Argonne National Laboratory. Partnering with the University of Southern California’s Information Sciences Institute, the groups will provide cloud-based services incluidng new privacy and security measures for controlled-access data, leveraging tools for managing protected health information. “Globus is used by thousands of researchers in other scientific fields with intensive computational and data needs, and our platform is ready to help support the architecture of the new NIH Data Commons,” said Ian Foster, co-founder and director of Globus and Professor of Computer Science at the University of Chicago in the University’s announcement of its awards. “We’re excited to bring our mission of accelerating research to this important effort that will unlock new discoveries.”

Tools and workflows for mining genomic data on many clouds: a pair of researchers from University of California, Davis and Curoverse Innovations. Titus Brown, associate professor in the Department of Population Health and Reproduction at the UC Davis School of Veterinary Medicine and at the Genome Center, is the principal investigator of the grant. “My role in this initial six-month consortium is to try to channel the larger goals of the consortium via a series of workshops and the like. Basically I'm trying to be the facilitator that gets everyone interoperating technically (and to some extent socially),” Brown said in a UC Davis announcement of the award.

A collaboration for the NIH Data Commons: a diverse team of researchers from Renaissance Computing Institute: RENCI; RTI International; Maastricht University; Oregon Health Science University; Lawrence Berkeley National Laboratory; the University of New Mexico Health Sciences Center; and The Jackson Laboratory. In a Jackson Laboratory announcement of the award, JAX said it is piloting software specifically focusing on cardiomyopathy, a common disease of the heart muscle. This new online Disease Navigator will be developed in conjunction with a consortium of model organism databases called the Alliance of Genome Resources (AGR), and will enable scientists who study cardiovascular disease to fast-track their research by accessing relevant genomic and other data from animal models (mouse and rat) cross-referenced to human data.

FAIR data to drive CURES: a team from Seven Bridges Genomics; Repositive; the US Department of Veterans Affairs; and Elsevier. Together the four groups are forming a private-public consortium called FAIR4CURES, according to a Seven Bridges announcement of the award.

“The NIH Data Commons promises to transform the way public biomedical data is stored and analyzed,” Brandi Davis-Dusenbery, CEO, Seven Bridges, said in the press release. The FAIR4CURES team will work within the overall Data Commons pilot to build a full-stack solution that unifies data from a variety of research environments into a single ecosystem that advances data discovery, access, and computation. Seven Bridges will lead the overall project using its existing cloud infrastructure for biomedical data analysis, which includes AWS, Google, and local compute storage solutions, and continue building interoperability standards, such as Common Workflow Language, to accelerate collaborative research and open source development.

CALIFORNIA: Cloud-agnostic Architecture to Locate Indexed FAIR objects and safely reuse them in new integrated analyses: a collaboration between researchers from University of California, San Diego; University of Oxford, Oxford e-Research Centre; and University of Texas Health Science Center at Houston. In the University of Oxford’s announcement of the awards, Lucila Ohno-Machado, an NIH Principal Investigator and UC San Diego Professor of Medicine and Associate Dean for Informatics and Technology said of the project: “Our award will focus on research ethics, privacy and security and on search and index activities; the latter work will be co-led by Professor [Susanna-Assunta] Sansone, in Oxford, and Professor [Hua] Xu, at UT Health.”

Datasets

Three NIH-funded data sets will serve as test cases for the NIH Data Commons Pilot Phase. The test cases include data sets from the Genotype-Tissue Expression (GTEx) curated at the Broad Institute; the Trans-Omics for Precision Medicine (TOPMed) dataset at the Universities of Michigan and Washington, as well as the Alliance of Genome Resources, a consortium of Model Organism Databases established in late 2016 at Stanford University. The stewards for each dataset received supplemental funding.

These data sets were chosen based on their value to users in the biomedical research community, the diversity of the data they contain, and their coverage of both basic and clinical research. While just three datasets will be used at the outset of the project, it is envisioned the NIH Data Commons efforts will expand to include other data resources once the pilot phase has achieved its primary objectives.