Comparing Open Source Research Data Management Tools
By Allison Proffitt
November 8, 2022 | At the Bio-IT World Conference and Expo, Europe, event in Berlin last month, Eelke van der Horst of The Hyve compared various open source research data management tools including The Hyve’s new offering: Fairspace.
The Hyve’s business is to create and provide support for open source solutions, to assist companies in data FAIRification in the life sciences, and to participate in pre-competitive health data projects, explained Van der Horst, a semantic modelling expert and data engineer with The Hyve. Over the years, the research data management space has grown and diversified, so Van der Horst brought together a team of data engineers, DevOps engineers, and more to evaluate the tools available and identified their various strengths and weaknesses.
“We need to see which research data management platforms are out there that we have seen used at clients or that we could adopt. And we also want to know in which scenarios we can apply those best. They chose to assess five tools: IRODS, Gen3, Fairspace, CEDAR Workbench, and COLID—all platforms they have seen used in production environments.
The assessment team considered pros and cons of each tool’s functionality as well as community. They considered code quality, how easy it is to install, the size of the tool community, and how users can support or contribute to a tool. For functionality, they considered the security model, how metadata are managed, what data tools are available, whether the tool is compatible with the cloud or can otherwise scale, what types of data analyses are available, and if it includes a computation notebook.
“We took a month… evaluating, installing, testing each tool,” Van der Horst said. The testing group used real world user requirements and application domains they had seen in their own work.
iRODS—integrated rule-oriented data system—is one of the most established platforms and is maintained by the iRODS Consortium. iRODS is driven by its rule engine, which lets users set up rules to enforce various policies, for example, to archive data according to a schedule. The tool is accessible via command line, though there are graphical user interfaces available too.
The Hyve team plotted each tool’s strength on a five-dimension plot: assessing its strength in analysis, metadata, data, computing, and cloud. iRODS strength lies in its data-management capabilities, but its handling of metadata is weaker.
Gen3, on the other hand, excels at metadata management. Gen3 was developed at the University of Chicago and is based on a hierarchical graph model. It powers several public data commons including NIH’s BioData Catalyst, NCI’s Cancer Research Data Commons, and more. “Gen3 is ideal because it offers that user interface portal to browse and explore data and it links to actual data in cloud storage buckets,” Van der Horst said. On The Hyve plot, Gen3 covers most areas well, especially exploring and querying metadata, but it doesn’t offer really interactive browsing, changing, modifying the actual data.
Van der Horst introduced The Hyve’s own offering—Fairspace—next, saying that it also offers a secure data portal where researchers can search for metadata and datasets. “Fairspace has the added advantage that it also offers filed data manipulation; researchers can really organize their data,” Van der Horst added.
Fairspace scores well on the metadata and computing axes, and—like Gen3—has an integrated Jupyter Notebook environment. Its weakest point is analysis. “You can search, but you can’t graph or get charge of the available data in your system,” he said. The data model for Fairspace is based on SHACL; metadata is stored in a graph store. The model is fairly flexbile, but Van der Horst advised: “You should have someone that’s familiar with SHACL to build these metadata models.”
COLID—Corporate Linked Data—was developed by Bayer and then released as open source. “We were really charmed by it,” Van der Horst said. The functionality is simple: a search tool lets users seek data in the COLID data marketplace. There is no data management capability but the tool excels at maintaining FAIR data. “It has a narrow application, but it does its job well,” Van der Horst commented.
Built by the Center for Expanded Data Annotation and Retrieval at Leland Stanford Junior University, CEDAR Workbench is used to create submission templates for metadata. “All of the previous tools have a metadata model defined for you by semantic modelling experts, for instance,” Van der Horst said. “This tool can really be used to sit with your scientist and define a minimal data model of metadata that should be submitted.” There can be multiple versions, he added, and the data that are stored and produced are in RDF. As expected, CEDAR workbench scores high on the metadata axis as well as the cloud axis.
Comparing and Contrasting
When comparing the plots of all five tools, Van der Horst noted that some overlapped considerably while others were complimentary to each other. For example, COLID and CEDAR are similar—both offer good metadata management—and iRODS offers the best data capabilities.
The documentation and community aspects of each tool varied as well. iRODS and Gen3 both had very mature documentation with thorough user guides, forums, developer documentations and more. Fairspace and COLID both had significant gaps in their documentation and CEDAR fell in the middle. iRODS, and to a slightly lesser extent Gen3, have lively and well-developed user communities, something that both Fairspace and COLID still need to build.
On the other hand, only iRODS and Fairspace installations rated “easy” both locally and in the cloud. Gen3’s installation in both areas was rated “hard” and COLID had some issues installing locally.
In summary, Van der Horst highlighted that each tool has its own best application scenario and all have both unique features and overlap. All are production ready, he said, so it’s good to think through your specific needs and how one or maybe two tools could best meet them.