The DICE Center’s open-source data management tool could be nirvana for NGS users.
By Kevin Davies
February 1, 2011 | Charles Schmitt, who directs biomedical and health informatics at RENCI (Renaissance Computing Institute), describes a common dilemma at the University of North Carolina (UNC), which has been investing in next-generation sequencing for medical and basic biology research.
“They’ve been building pipelines for RNA-Seq, Chip-Seq, exomic and full sequence analysis, but people are still discovering how to deal with the data,” says Schmitt. “How do we build databases to query variants on thousands of subjects? How do we annotate tens to hundreds of terabytes of data? How do we organize the data for downstream analysis?”
For Schmitt, who is particularly interested in applying computer science to enable high-throughput research efforts, the potential solution for his data movement/management woes turned out to be right on his doorstep. That resource, called iRODS (Integrated Rule Oriented Data System), is attracting a lot of interest from many other stakeholders in the next-gen sequencing (NGS) field.
“iRODS is very good,” says consultant Chris Smith, co-founder of Distributed Bio (see, “Follow the Rules”). “The systems using it now are such large scale, with multiple sites, that it’s the only kind of system that really works.”
The principal investigator behind iRODS is Reagan Moore, who is also RENCI’s chief scientist for the Data Intensive Cyber Environments (DICE) Center, set up at UNC last year. “Reagan Moore promoted the notion of metadata-driven file management,” says Smith. “These systems came out when grids were still fashionable.”
“We have to handle the fact that data are sitting in different domains, on different devices, under the control of different groups,” says Moore. “For the system to be viable, we’ve had to employ multiple levels of virtualization.” The iRODS open-source software manages the properties of data registered into sharable collections, independent of location. It manages descriptive metadata about each item, including access controls, authorization, and integrity checks, and enforces management policies for basically any desired property.
Moore credits the iRODS architectural team, led by Mike Wan, which wrote the software infrastructure that enables iRODS to organize distributed data into a shareable collection. Arcot Rajasekar developed a distributed rule engine for enforcing management policies, and Wayne Schroeder implemented security mechanisms, the metadata catalogue, and productization procedures.
iRODS is second-generation technology, succeeding a data grid known as storage resource broker (SRB), which started in 1996 at the San Diego Supercomputer Center. The last update to SRB was in 2005; since then, all efforts have been focused on iRODS.
Most of the original users of iRODS were groups, particularly in astronomy and oceanography, that needed to share and distribute petabytes of data around the world—the National Optical Astronomy Observatory, which sends data from Chile to American archives; science domains using National Science Foundation (NSF) supercomputer centers, and so on. The French National Library uses iRODS, as does the Australian Research Collaboration Service, the Carolina Digital Repository, and the iPlant Collaborative.
When Moore’s team originally wrote SRB, consistency policies were hard-coded in the software for security purposes. But some user communities, such as the UK’s e-Science DataGrid, wanted different policies for different parts of their data collection. “Some groups might not allow data deletion, others might want read/write capability, and so on,” says Moore. “We recognized that we had to abstract the policies so they could be managed independently.”
iRODS can manage the properties of data collections that reside anywhere, including the Cloud. “We’ve written drivers that speak Amazon S3 protocols, so I can build a collection that spans Cloud storage, and any data source anywhere in the world,” says Moore. “We can then assign properties we want to that collection… We’ve not found any storage system we haven’t been able to interface to.”
In terms of deploying iRODS, Moore stresses that each community needs to have a good idea of the policies it wants to enforce for extracting metadata, replicating data, validating assessment criteria, and so on. “We provide basic functions to compose these procedures from micro-services, written in C, that run remotely,” explains Moore. “When we distribute software, we build a distributed rule engine at each storage site. Data can’t leave your storage system unless it complies with your rules.” No matter whether the user is interested in sharing, or publishing in a digital library, or building preservation environments, “all require the same capabilities; they just change their policies and procedures.”
One of the biggest NGS projects at UNC, directed by Kirk Wilhelmsen, involves deep sequencing in several thousand patients with addiction problems. “Most of our computations across aligned sequence data (BAM files) will be driven through the Hadoop framework. All the called variants will be stored in a Postgre-SQL relational database and linked to internal and external annotations,” says Schmitt, who hopes that iRODS can manage the hundreds of terabytes of NGS data coming in and out of the system while leveraging the available computational resources at UNC and RENCI. “We have a distributed data problem, especially as data moves out of production pipelines and into downstream analysis. We have different compute clusters tied to different systems.”
Schmitt describes the primary advantage of iRODS as follows: “Before iRODS, we ended up with a broad assortment of batch files and scripts and various tools for moving data around, e.g. Robocopy. This works fine for small things, but it’s very brittle as you scale up and programmers move on. Keeping track of everything becomes a nightmare. Data goes through multiple stages of processing, which means they’re sitting on multiple clusters of servers [which is not] a good framework for automating movement between systems.”
Another early proponent of iRODS is Guy Coates at the Wellcome Trust Sanger Institute, a member of the informatics team grappling with the data inflation problem posed by next-gen sequencing (see, “iRODS: Wellcome Relief”)
“We’re heavily hoping iRODS is the right framework,” Schmitt continues. “At this point, we’ve done enough high-level review that we feel any issues can be resolved. Architecturally, we think it’s fine.” The current iRODS grid that has been setup for sequencing is “a fairly lightweight deployment,” but plans have been laid out to expand it significantly.
A proposed “BioRODS” would have an API layer on top oriented to key tasks. Schmitt and Leesa Brieger, project manager for the iRODS@RENCI development team, have conferred with counterparts at the Broad Institute, Sanger Institute, and The Genome Center at Washington University to come up with a set of API requirements. The primary needs are:
1) Improve and automate ingestion of metadata. Groups need better ways to manage metadata after a PI or grad student leaves.
2) Enable processing of sequence data based on metadata, such as discovery of duplicate files and automated archiving, deletion, and publication of data, by walking though the data.
One life sciences project that has been using iRODS for a few years now is the Motif Network, a collaboration between RENCI and the National Center for Supercomputing Applications. iRODS is deployed behind the large workflows (using Taverna) in the discovery of domain architectures within gene and protein sequences to manage data movement.
“iRODS isn’t that heavy,” says Schmitt. “You can download the server and get it up and running pretty quickly. It’s not very resource intensive on the computer or IT-wise. The biggest thing that needs to trickle down are some more user-friendly interfaces that allow non-IT people to leverage the power of iRODS rule-based mechanisms.”
Moore’s group is addressing this, by developing an iDrop interface akin to a Dropbox for iRODS. “This would let users seamlessly use iRODS,” explains Schmitt. “Dropbox goes to the Amazon cloud; iRODS would go to the department grid. It’s a much more appropriate dropbox for academics. It’s really exciting!”
As far as sharing data, Schmitt’s group has only used iRODS for sharing data between servers and desktops. “It’s our way of having a shared file system. We need to better manage metadata to do automated processes and computational workflows.” That work will begin in early 2011.
Moore says that one of his mandates under NSF funding is to enable use of the technology. Technical questions can be answered by the iRODS community online or a member of Moore’s team. •
Follow the Rules
Chris Smith spent 13 years at Platform Computing, in large part working with Platform LSF. Last fall, he and ex-Pfizer bioinformatician Giles Day set up their own bio-IT consultancy, Distributed Bio. “We provide a multidimensional view of things,” says Smith. “I understand the technology for grids, clouds and clusters.”
Smith’s interest in iRODS as a flexible solution for handling large volumes of data dates back to his days scheduling jobs at Platform, where data location became an important consideration. “A pet interest of mine is to figure out how to make a scheduler more intelligent—where it places jobs with respect to data? You need a metadata service for the scheduler to use. Where is this file located? Where are the replicas?”
iRODS allows users to make these decisions about where to locate and replicate a file. “As I saw Platform’s customers struggling with data, located across gazillions of file heads, or a massive Isilon or Panasas cluster, I thought of my digital camera problem. That layer of manageability was missing. iRODS is a key piece that helps manageability.”
Smith sees two key attributes to iRODS. First is the ability “to annotate data with metadata over and above the usual Unix time-based stuff. This is the part that will be useful to end users.”
Second, the rules engine capability makes data management very powerful. “You can execute a number of rules on how you manage data over time. For example, expiry—I have data living on very expensive tier 1 storage. At a certain time, I migrate to tier 2. You can do very complex things—automatic replication to disaster recovery sites, check summing, combat bit rot a little.”
For the Broad Institute and other genome centers, Smith says large datasets that are three months old can be moved. “Is this dataset part of a collection that’s meaningful? Let’s put a project time stamp on it.”
iRODS is open source and “as free as you can find someone to install it,” says Smith. “You don’t have to worry about licensing costs. There’s a very active mailing list for support. I think they have a vision of providing a little more commercial support.” Smith says he found iRODS “incredibly easy” to install—barely 30 minutes to compile and install a database and ODBC (Open Database Connectivity).
To get the most use out of iRODS requires the user understanding their goals, the archiving rules, the style of deployment, and how to collect one’s data centers into a zone. “I think anybody could benefit,” says Smith. “It provides something the file system is missing. Managing the data deluge on a large workstation otherwise requires Windows Index Search or Mac OS X Spotlight or Search. It’s not well directed, you get a lot of false positives. It requires you to sift through the data. The metadata approach is much more structured.”
Smith wonders if the software bundled with NGS machines might be made iRODS compatible. “If these mechanisms of annotating can be automatic, that reduces the burden on end users and it provides value. You avoid structured naming and the accidents around that. We need to move away from that, to a point where file structure is meaningful to end users.”
Overall, Smith advocates that that people start thinking about metadata. “It’s almost cliché in certain circles, but I think everyone will benefit from it. The file system is not a sustainable object store at scale. You can try other things, but people have to start thinking from the point of view of the scientific pipelines.”
iRODS: Wellcome Relief
“Data management has become the real bottleneck for us,” says Guy Coates, a member of the Informatics Systems Group at the Wellcome Trust Sanger Institute in the UK. “The big problem is the downstream [sequence] analysis by the informaticians. We have a huge data inflation problem. If we have 1 TB data, that often multiplies to 10 TB working data on the compute farm.”
Because moving data is hard, Coates says researchers frequently do it once and leave it. “Because no one can find data, everyone decides to keep a copy for themselves. One quite often finds every Ph.D. student will take a copy of the data just to be sure they know where it is!” Another issue is that patient data often comes with different types of consent and rules regarding privacy.
To help decide which data to keep and for how long, the metadata are key, says Coates. (Metadata are descriptors that list the experiment or project, the source of the sample, the lead investigator, and the instrument used.)
For the past 12 months, Coates and colleagues have been using a trial implementation of iRODS to hold BAM files and associated metadata. The Sanger team applies a couple of data management rules: The first governs data replication. “Every file that comes in is replicated immediately to two different storage arrays in two different data centers,” says Coates. This hedges against a data center accident or users accidentally deleting important files.
Another set of rules sets data access controls. “The iRODS system can look at the study. If it came from a study where certain data access rules apply to prevent data from going out the front door, it marks the data as private so we don’t accidentally release it.”
Although still a work in progress, Coates says that culturally, “people are starting to take notice of data management. They’re no longer losing files or running out of disk space halfway through an analysis.” Another clever incentive is to charge researchers for the storage space they use: those who apply good data management practices use less storage, which sends a powerful message that filters back through the organization.
This article also appeared in the January-February 2011 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.