FAIR Hacking: Bio-IT World Hosts Its First Hackathon
By Benjamin Ross
June 9, 2017 | The 16th annual Bio-IT World Conference & Expo featured its first Bio-IT Hackathon, a competition focusing on FAIR data, data that are findable, accessible, interoperable, and reusable.
The topic of discussion in a March 2016 comment in Nature, FAIR data and its principles—designed by stakeholders in academia, industry, and scholarly publishers—“act as a guideline for those wishing to enhance the reusability of their data holdings.”
The FAIR data principle is focused on the idea of good data management. According to the authors of the Nature comment, it is not the goal in and of itself, but is instead “the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”
“There’s a spectrum of difficult issues that we’ve never made systematic,” said Erik Schultes, FAIR Data Scientific Projects Lead at Dutch Techcentre for Life Sciences, in his lecture opening the Hackathon. “But we’re reaching a point in time where we need to start doing that… We’re starting to feel the effects of data overload and we’re having a hard time reusing data.”
The half-life of supplemental data links is a good example of this. Schultes referred to studies from Nature that said most of those links would disappear altogether after 20 years or so.
Schultes even went into detail about his experiences with unusable and lost data. “[My colleagues and I at MIT in 1997] published a paper where we made 100 different RNA constructs,” Schultes recalled. “We tested their activity, I ran hundreds of cells, quantified those, normalized the data, and made it such that I could actually make conclusions on those data. My data stewardship plan at that date had advanced to zip drives.” Holding an “amazing” 100MB of data seems like a big deal. The only problem is Schultes doesn’t know where they are. “A couple thousand hours of work just vanished.”
This discussion of FAIR data inevitably leads to the topic of data stewardship, a plan that maximizes the use of one’s data.
There are 15 principles that ensure data is FAIR in addition to the acronym qualifiers findable, accessible, interoperable, and reusable data. These principles range from metadata clearly and explicitly including the identifier of the data it describes in the “finable” category, to data meeting domain-relevant community standards in the “reusable” category.
“The acronym [FAIR] has enjoyed unusually rapid updates,” said Schultes. “But there’s a kind of danger there that, because FAIR becomes so popular lots of people like to claim FAIRness without really understanding what it is.” FAIR is not a standard, is not equal to “Semantic Web”, is not equal to “Open” or “Free” data, and is not for humans only; the primary intention of FAIR data is to have machines do more of the work for the researcher.
Data and metadata are placed onto an RDF file, which is then typed onto a FAIR data point. A FAIR data point is an API layered with metadata that allows a researcher’s data to be located automatically, essentially a collection of metadata arranged into a hierarchy. The hierarchy includes languages, keyword, titles, publisher, etc. The hierarchy set up is merely the minimum requirement, Schultes said. The metadata are arbitrarily extendible, and the more researchers personally augment the model with their own personal parameters the better off it will be.
“The idea for this Hackathon was that people would come with their datasets and they would at first assess the FAIRness of their existing dataset or resource,” Schultes told the attendees of the Bio-IT World Conference & Expo during the Hackathon’s award ceremony. “They would then, within a 24-hour period, figure out how they could improve upon their FAIRness with the resources they had available.”
Three teams completed the Bio-IT Hackathon, focusing on datasets from Clinvar, the Foundation Medicine dataset, and a personal dataset from 23andMe.
The team working on 23andMe’s dataset came in 3rd place. 23andMe’s genomic testing offers the option of sending customers their raw genomic data. From the limited amount of data they had, which include Reference SNP cluster IDs (rsids), chromosomes, and the individual’s genotype, the hackathon team was able map out the data onto a VCF file from the NCBI. From the 600,000 genes that were provided in the raw data, the “FAIRify-23andMe” team was able to use annotation software to whittle down those numbers to around 7,000 relevant samples. “[The dataset they were working with] is highly relevant,” remarked Kees van Bochove, CEO at The Hyve and one of the judges of the competition.
2nd place went to the team working with the Foundation Medicine dataset. Naming themselves “dmi-pediatrics-oncology,” the team worked with Foundation Medicine’s published dataset of mutated genes in their pediatric cancer samples. The dataset is intriguing in that allows individuals to visualize and interact with the data. However, the data is only downloadable as an Excel file, limiting the interoperability of the data. Over the course of the competition the team was able to work the dataset into the Discover platform tool, making it easier to access the metadata through an API and allowing researchers to map out the unique identifiers within the cancer samples.
Ultimately 1st prize went to the team handling Clinvar’s dataset, team “FAIR-Clinvar”. Working with the NIH and NCBI clinical variation database, the team added global identifiers to make the database more findable in coherence with the FAIR principles. In terms of interoperability, the team attempted to connect the clinical variants, the clinical diseases, and the significance of those diseases into a clearer relationship. “Not only did they work on the interoperability of the dataset by adding identifiers… But they also did a lot of work to model different levels of metadata in the RDF files,” van Bochove said. “As a [judging panel] we were deeply impressed by their work…”
The European Connection
The idea of FAIR data is nothing new, though awareness of the principles in America is not as wide-reaching as it is in Europe. Late last year the European Commission (EC) published the first report of the Commission High Level Expert Group on the European Open Science Cloud (HLEG EOSC). In the report the EC recommend “framing the EOSC as the EU contribution to a future, global Internet of FAIR Data and Services underpinned by open protocols.”
“If you want public money from Europe, you’re going to have to set aside at least 5% of your research budget to data stewardship,” Schultes said. “That’s true now, but it’s going to become compulsory with consequences in the coming years.”
According to Schultes, members of European states that want to be early movers implementing the EOSC have come together to form what they call the GO FAIR Movement, an implementation initiative towards the Internet of FAIR Data and Services.
This movement towards FAIR data is global. The question now becomes how far can the movement go in the coming years.
2017 Bio-IT Hackathon Team Members
TEAM # 1: FAIR-Clinvar
TEAM # 2: dmi-pediatrics-oncology
TEAM # 3: FAIRify-23andme
Russell (Yune) Kunes