Bio-IT FAIR Data Hackathon 'Pushes The Needle' In Science

Editor's Note: The number of participating teams has been corrected in the first paragraph. The additional teams are represented in paragraphs twenty-eight through thirty-six.

By Benjamin Ross

May 13, 2019 | BOSTON—The Bio-IT World Conference & Expo recently hosted the third annual Bio-IT FAIR Data Hackathon, giving experts in life sciences and IT the opportunity to FAIR-ify a range of existing data sets. Eight teams of researchers spent two days using unique identifiers, linking additional data sets, and collecting appropriate metadata, all the while adhering to the principles of FAIR—Findable, Accessible, Interoperable, Reusable—data.

Ben Busby, a data scientist at the NCBI and a main organizer for this year's event, says hackathons have recently become the hub of innovation, with solutions and ongoing projects getting their starts in these cohesive, creative environments.

"What we're doing in these hackathons is really pushing the needle in science," Busby said during the Hackathon's report out. "People think of hackathons as a fun opportunity to share ideas, and that's great. But we want to produce prototypes that move science forward, and I think that's what we've done over the past few days."

Datasets from Collaborative Drug Discovery, NCBI, the Broad Institute, the Jackson Laboratory, the US Department of Energy (DOE) Joint Genome Institute (JGI), Find Bioscience, and Globus at the University of Chicago were given the FAIR treatment, with results varying from simple tinkering with the readability of a dataset to a revamping of standards that weren't properly enforced.

Report Cards

Find Bioscience's team got the ball rolling, discussing their work generating a fungal index in the Sequence Read Archive (SRA), an archive of raw sequence data primarily from fungi. It's a wonderful resource, Matthew Blumberg, the team's lead, said during the team's report out. So wonderful, in fact, that it has now become cluttered with data.

"For the hackathon we wanted to improve the function of a specific domain that would allow people to find a particular resource easier," said Blumberg. "We wanted to help people find instances of fungi in that archive without having to do a BLAST [Basic Local Alignment Search Tool] of the entire archive."

The initial version of the application simply gave the raw output of the BLAST. Blumberg and his colleagues wanted to find a way to break up those search results into more manageable chunks. Searching through over 78,000 records within one subset, and then referencing them with 10,400 internalization tag sets (ITSs), the team got 8 million hits within the sample index.

"This work is a modular thing," Blumberg said. "We plan to build up more modules, and also allow individuals with special interests to add their own fields to the metadata with a Python function."

The next project, conducted by the NCBI team, took a personalized approach to their dataset, wanting to improve the functionality of pipelines.

"We didn’t focus on making the pipeline FAIR," Tom Madden, the team lead, said. "Instead we wanted to enable people to make the pipelines FAIR and improve them."

Madden and his team have recently been working on a way to enable BLAST on the cloud. The team used a common workflow language (CWL) in the NCBI's bioinformatics pipeline, allowing data to become both more reproducible and interoperable, according to Madden. The team also simplified the pipeline's search capabilities, reducing the steps from three to one.

The machine-readable pipeline, specified by a YAML Ain't Markup Language (YAML) file, can be loaded onto Github, Madden explained, allowing users to create an automated diagram of what happened, which would enhance the reusability of the data in question.

"Say you wanted to see if a protein reacted the same way in two different organisms," Madden said. "Typically, you would run two separate BLAST searches, where you would run all the queries of organism A against organism B, and then reverse that into a procedure in a Python script." The hackathon team worked on a way to simplify the procedure.

The team from Collaborative Drug Discovery focused on bioassay protocols and how they could be represented as FAIR in the literature.

"There's a number of initiatives for biomedical and biological investigations that has set aside a number of assay-specific, reporting guidelines for papers," Samantha Jeschonek, the team lead for Collaborative Drug Discovery's hackathon team, said. "These templates are a really good start to getting metadata in place and actually capturing important information about experiments, but where they fall short is in the execution."

Oftentimes these guidelines are just Excel spreadsheets from different sources, Jeschonek says. No standard vocabulary and no associated ontologies, either. On the scale of FAIR, Jeschonek says these guidelines are barely on the radar.

Collaborative Drug Discovery has previously created custom metadata annotation templates for a subset of 5 minimum information guidelines for qPCR, microarray, RNAi, in situ hybridization/immunohistochemistry, and flow cytometry relating to experimental assay protocols. The hackathon team wanted to evaluate how well the templates enforced the report of guidelines.

The team found that the papers weren't doing a great job at enforcing guidelines, which Jeschonek says they were anticipating, especially when analyzing work in next generation sequencing. "If you wanted to take a paper and get information about the instrumentation used, how the construction of a lab was set up, as well as the bio sampling, then you had to go through three different links," she said.

In the future we’re hoping to have a common place where all that information can be provided quickly with a machine learning, predictive text algorithm, ensuring the data is captured FAIR-ly.

The next project, which looked at the DOE JGI Genomics dataset, was conducted by a team from the US Department of Energy Joint Genome Institute. As the DOE's flagship sequencing facility, the JGI receives DNA and RNA sequences from researchers in the community on a daily basis. That means a huge repository of publicly available data, Kjiersten Fagnan, JGI member and team lead, says. "We're really invested in trying to make our systems more accessible to the outside world," she said.

The challenge in making this data FAIR is sheer volume, Fagnan says. There are currently 20,000 datasets at JGI, with another 20,000 not made public, each one processed with different pipelines, and the metadata associated with each dataset is highly variable.

The JGI team assessed the FAIR-ness of the dataset's access point, working to link it to other community efforts. The idea, Fagnan says, is to provide a central hub for a researcher to access all these datasets.

"Diverse data mean you're dealing with diverse interpretations of that data," she said. The team chose 12 datasets to see if they could link them to the Environmental Ontology (ENVO). Using ONTOFORCE's linked open data platform, DISQOVER (a 2019 Best of Show winner), the team linked their data to the broader data repositories available.

"The final step when I go back to JGI is actually show the scientific community why it matters to have applied these ontological terms to our metadata," Fagnan said. "Having this search interface is a nice way of showing how your data can link to the broader data repository."

The Broad Institute team worked on the institute's Single-Cell RNA-Seq data set, specifically building a visualization for cancer genomes using single-cell seq data.

"One question asked by a lot of RNA-seq researchers is, 'Where are the gains and losses in the genomes of tumor biopsy samples?'" Vicky Horst, team lead of the Broad Institute, said. "To answer that question, we've been working on visualization for looking at these gains and losses, while also reproducing a study that's on a single-cell portal."

The team focused on each FAIR principle, making sure the metadata is machine-readable, that their application is retrievable by any authenticated user, that they used modern frameworks, and that their workflow was reproducible on Terra, the Broad's scalable, cloud platform for biomedical data.

A second team from the Broad Institute, led by Geraldine Van Der Auwera, approached their dataset by wanting to bring the "power of synthetic sequenced data generation to the masses," according to Van Der Auwera.

"We really focused on the 'R' of FAIR," she said. "When you think about reproducibility, you're taking some data, processing it through some type of code, getting results, and then there's a magic step that happens where you extract the biological findings, that is, you interpret the results."

That interpretation and analysis must be reproducible if it is to be of any use, Van Der Auwera says. If two studies have the same type of data, they should have the same results when run through the same analysis. This is where FAIR comes in.

Van Der Auwera and her team had previously worked on a solution involving the generation of custom synthetic datasets, using an ASHG 2018 project that focused on risk factors for congenital heart disease as a template. Due to privacy protections, the they couldn't access the original data, which resulted in the team synthesizing the exosome samples needed to reproduce the study.

Using this framework, the team wanted to make this a community resource, streamlining the tools for generating the custom synthetic data efficiently.

The best way to become better data stewards is to create better data standards, says Anne Deslattes Mays, team lead of the Jackson Laboratory's dataset work. "Standards matter. Without standards, we can't have things interoperate."

The team wanted to find a way to connect human data with mouse data. Mouse models are still required for drug approval and are valuable ways to understand biology, Mays says. The team approached FAIR in two terms, FAIR for humans, meaning visual interactions with the data, and FAIR for machines, meaning the data itself.

"We want to be able to do more with less people in less time," Mays said. "We want to be able to answer scientific questions and move things forward, even if it is incrementally, which might lead to us being able to do it explosively."

The team focused on the application of Variant Call Format (VCF) file input and output. A tool for annotating and prioritizing exome variants, called Exomiser, was used for the input of metadata, while DISQOVER linked out different data points within the metadata on file.

Globus wanted to enable data transfer so that it could be a natural, seamless part of workflows, Rick Wagner, Globus' hackathon team lead, said. "Once we have that, then we can start doing really cool stuff."

Targeting Galaxy, the team tried to incorporate data movement in and out of resources as part of Galaxy workflows. After selecting a workflow, the team wrote an initial tool that retrieved, automatically as part of the workflow, the input dataset remotely and fed them into a pipeline.

"Instead of making the researcher do all the data wrestling, let's make that one of the steps in the workflow, and take one more piece of the system out of their thought process so they can focus on the science," said Wagner.