iRODS Consortium Carries Open Source Data Management Software Forward
By Paul Nicolaus
February 15, 2018 | Integrated Rule-Oriented Data System (iRODS) is used across the globe in industries ranging from the life and physical sciences to media and entertainment, but the software’s origins can be traced back over two decades to a team at the San Diego Supercomputer Center (SDSC) and a project known as the Storage Resource Broker (SRB).
As large clusters generated massive amounts of data from physics simulations, the challenge was to capture and organize all of that information in a way that was useful to the scientists, explained Jason Coposky. Together, General Atomics, the Data Intensive Computing Environments (DICE) group, and the SDSC at the University of California developed a solution that went on to become a commercial product known as Nirvana.
But the researchers wanted to continue their work on data management technology, so the DICE group started over again. This time around, they went open source to be clear of intellectual property ownership and other restrictions and created iRODS in 2006. With some of DICE’s members located at the University of North Carolina at Chapel Hill, the Renaissance Computing Institute (RENCI) became more heavily involved and created a software engineering team to continue its development.
Eventually, though, the research was completed, the funding for building a data management solution dried up, and there remained this piece of software that had become an important tool for some of its users. At that point, there was a need to come up with a sustainability model for something that was essentially free to use, so in 2013 RENCI founded the iRODS Consortium. The idea is that the users for which it had become so critical could come to the table with not only dollars but also expertise on how it has been deployed and what is needed for improvement.
As executive director of the iRODS Consortium, Coposky heads up the membership that is made up of businesses, research organizations, universities, and government agencies. Wellcome Sanger Institute uses it to manage tens of petabytes of genomics data, for example, and Bayer's Crop Science division uses it across their domains. There are also high-performance computing (HPC) organizations that use it to manage the information coming off their systems.
Data Centric, Metadata Driven
Within computing infrastructure, data typically flows somewhat haphazardly from one spot to another, Coposky explained, and iRODS provides the ability to wrap the infrastructure around the data. When information flows through the data management lifecycle, a story is told. Users get a role, and collections of data get an identity. "We can make all of that infrastructure and data actionable by attaching metadata to it," he said.
From a user-facing perspective, the flexibility and "future-proofness" of the software product is what makes iRODS especially compelling. "Using one of our core competencies, the data virtualization, you immediately insulate yourself from all of your storage," he explained, which provides the ability to add new storage, decommission old storage, and replicate data automatically across different storage systems.
As researchers perform analysis and want to collaborate with one another, the data management requirements shift and change. Because of this, Coposky said, it's important to have a system that can flex and grow with the data as it moves through its life cycle, and iRODS gives implementers of the system the ability to provide those requirements.
Since inception, a number of companies and organizations have developed products that work with or on top of iRODS. EMC Metalnx, for example, is an open-source web application developed for IT administrators, data engineers, and research investigators to capture, manage, and apply metadata to research data collections.
And according to Coposky, the European project dubbed EUDAT uses iRODS extensively to manage research data across many universities throughout Europe. At this point, he says, the software is more popular overseas than in the United States since the requirements for data management and privacy are considerably stricter in Europe.
He pointed to University College London as a typical use case. In the UK, there is a requirement that data must be accessible for 10 years after the last time of use. "That is a very difficult problem to solve at scale," he said, but iRODS provides that capability because it is possible to keep track of the information and who has accessed it.
It is also possible to generate reports for a data retention policy. The idea is that if no one has accessed the data or referenced it in a paper in the last 10 years, the information can be moved from near and fast storage to a different storage area, at which point it goes cold and eventually becomes part of a report indicating that it may be deleted, if desired.
Room for Improvement
"I find iRODS powerful in the sense that it virtualizes your data," says John Jacquay, scientific systems engineer at BioTeam, a life science informatics and bio-IT consulting services company that helps its clients solve data management issues. "It really provides unified namespace for all of your data, and it provides a lot of abstractions and features on top of that."
It allows for data sharing between organizations and supports access control lists and permissions. It is also possible to replicate data, create geographically distinct backups, and create automated workflows. "So iRODS is really a fantastic framework to basically intelligently manage large datasets that aren't confined to single systems," he explained.
Because iRODS is extendable and modular, developers can use the software to perform virtually any type of functionality imagined, but it is also extremely complex. At its core, the software runs a lot of different code, and it can be a challenge to understand what's going on within it. And as middleware, it is not a fully baked solution. "You can't just install iRODS for a client and expect them to be able to utilize all the features and functionality it provides,” he said, “so it really requires a lot of additional development to make it useful."
The software is mainly designed around file management, added Simon Twigger, senior scientific consultant at BioTeam. With a background in lab science and metadata, he looks at the tool from an end user’s perspective. And a scientist often wants to use iRODS as an exploration tool, not just a file browser.
To make that happen, the metadata becomes critical because of the many standards scientists need to adhere to, particularly in the biological sciences. The software has some nice features for extracting metadata from files and capturing it automatically, he says, but its support for other styles of metadata has historically been "a little primitive."
To some extent, he said, "its flexibility is also part of its curse." Developers building tools on top of iRODS have to pay close attention to managing that metadata because with this type of system it is "garbage in, garbage out." Unless there is an effort to ensure that people put in good system metadata or manage it effectively, the ability to query it again is limited.
"There have been some developments over the last year or two to implement metadata templates," Twigger said, "which are, I think, a nice advance in this particular area." This would allow users to define specific metadata fields that are required to be present for a file to be uploaded or put into a particular collection. But as far as he is aware, this element has yet to become a reality.
Although there are other browsers and other tools that sit on top of iRODS, a gap still exists between the core technical components, which he views as "really nice and incredibly useful" and a user-friendly front end that could be leveraged by less technical users. Finding better ways to tackle the front-end side of things “to leverage the power that iRODS provides under the hood" is an area that he said needs to be expanded upon moving forward.
What's New and What's Next
An audit plugin has been built to shift all of the interesting events that happen within the system out over an Advanced Message Queuing Protocol (AMQP) message bus. All of those events can be streamed into Elastic Stack—a stack of technology used to gather logs and analyze the content—for dashboarding and system health. But the plugin also offers full provenance of the data, which can be queried. "This becomes very interesting from a reproducible science point of view," he added.
Any data that gets streamed or placed into iRODS can, using this same audit plugin, trigger indexing. "Beyond that," Coposky said, "we now have the ability to shift different features as plugins rather than as part of the core." This means the core will remain stable and all of the new features can move at their appropriate development pace rather than keeping up with the development of the core itself. Looking ahead, work continues on the upcoming 4.3 release, which represents a hardening of the existing core and a new phase of development.
After inheriting an older code base that was written in C, recent years have been devoted to addressing a lot of the issues that came along with this inherited code base, BioTeam’s Jacquay explained. Now there’s an effort to move from maintenance and upkeep to new feature implementation, and as he considers the improvements still to come there are several he's particularly excited about.
This includes the integration with parallel distributed files systems, such as IBM General Parallel File System (GPFS), Lustre, or EMC Isilon. Instead of letting iRODS control the data and where it's stored, the software would instead act as a kind of registry. This is a good way to expose these large file systems inside of an iRODS namespace, he explained, but it also brings about the problem of synchronization.
In the current system, this would be addressed through polling to look for differences between the data stores and figure out which files had changed, which had been deleted, and which had been added. The consortium is working on a plugin to allow push notifications to the iRODS iCAT database rather than using this type of polling. "Every time a file is changed externally on this file system, iRODS is notified and the database is updated," Jacquay said. "This will bring about massive improvements in performance, and it will really allow iRODS to scale."
A multipart data object feature would make it possible to split large files and data objects into chunks and distribute them across resource servers, and there are intentions to provide an easy way to install different plugins rather than having to grab the source code and build them. Another interesting development noted on the iRODS Roadmap, he said, is the next-generation networking API.
Currently, a custom binary network protocol is used, so if you have an iRODS server on your network each of the clients that want to talk to iRODS has to understand how to speak that language. Now, though, there is a move toward using a single serialization and messaging framework that provides a common language, he said, which “opens up the floodgates for integration.”
Because iRODS is an older technology, Jacquay acknowledged that it can be difficult to compete with newer pieces of software and technology coming up, such as Starfish and Mediaflux, but he predicts that all of this planned development will “invigorate the community and provide some well-needed refreshment.”
Paul Nicolaus is a freelance writer specializing in science, technology, and health. Learn more at www.nicolauswriting.com.