Data Storage and Transport: Headaches and Possible Remedies

By Paul Nicolaus

February 5, 2020 | As organizations generate ever-growing amounts of data, finding ways to overcome the related difficulties of storing, moving, and managing that flood of information remains a central issue.

Nearly every research institution in the healthcare and life sciences space faces the need to distribute large and frequently changing data sets, explained Seth Noble, founder and CEO of Data Expedition, which provides software solutions for network performance and reliability.

The ability to send and receive large volumes of scientific research quickly and reliably is critical, but large genomic datasets can take hours or even days to transfer. Whether stored locally or in the cloud, he added, transporting data to research colleagues in a timely manner is a major challenge.

Many continue to use hybrid storage approaches where data is hosted partially in cloud environments and partially in private data centers, explained Mark Lambrecht, global director for health and life sciences at SAS, an analytics and data management software company. This is because some datasets are so large that it just isn't feasible, from a cost standpoint, to bring all data to cloud storage even if there are good business reasons for doing so, he said.

In other instances, cloud storage may be restricted. One example is patient data that cannot be moved beyond the walls of a medical facility. This calls for federated approaches where analysis is handled over different locations and only the summary results are centrally stored. “These hybrid and federated approaches require robust transport data standards that can transfer data without losing any semantic meaning to the data and how they are interlinked,” he explained.

The wide world of life sciences is getting smaller, according to Steve Levine, as digital technologies close the loop between researchers discovering new treatments and medical practitioners figuring out “precisely which to provide to whom.”

Upstream, research and development scientists may disagree on plenty, but one thing they all agree on is that they are inundated with data, continued Levine, the senior director of life sciences at Dassault Systèmes, a software business headquartered in France with a campus in Providence, RI, where the 3DS SIMULIA brand is headquartered. Factors like competition and the quickening pace of research have nudged organizations to explore new ways of extracting every bit of value from the data that is collected.

“Downstream, with the onset of value-based care and the dawn of precision medicine, providing the best patient experience is driving the digital transformation in care centers very rapidly,” he added. Unlike other industries that strive to provide the best products or services at the lowest cost, however, the healthcare industry looks to deliver the best solution at the most reasonable price.

The vagueness of what is considered reasonable has allowed the industry to move forward without placing an emphasis on data standards. This, in turn, has slowed the adoption of digital technologies and allowed “a vast diversity of data types to proliferate.”

While the cloud may offer a lower entry to providing data access, Levine contends that it is not the key technology that determines utility. In the life sciences, “the key is not so much where the data is stored,” he added, “but rather how the data is stored.”

Data Management Resources

In recent years, there has been an incredible growth of data coming from instruments, said Vas Vasiliadis, who leads the customer team for Globus and teaches in the Computer Science program at the University of Chicago.

Next-generation sequencing has been around for a while, "but we're also seeing some very high-resolution instruments" such as light sheet and cryo-electron microscopes growing substantially over the last year or two, he said. Other examples include high-resolution photon sources like The Advanced Photon Source at Argonne National Laboratory or the UK’s Diamond Light Source.

Many of these instruments deliver a continuous flow of data but have little local storage. It is important to be able to pick that data up and move it somewhere else where it can be stored more permanently and where people can gain easier access to it for downstream analysis.

"The idea is, get the data off there quickly, and very importantly, get it off there reliably," he said, because in many scenarios it isn't possible to reuse a sample or redo an experiment.

Being able to do this at scale and in an automated fashion is essential, too, considering these types of instruments tend to be expensive—and shared—resources. There are people lined up to use them, Vasiliadis added, so the quicker you can get data off of there from the previous run and move on to the next sample, the better.

A hybrid solution for research data management, Globus enables users to move, share, and discover data using a single interface. Whether files are stored on a supercomputer or a laptop, the data can be managed from anywhere using a web browser. The storage is all owned and controlled by users as this cloud-hosted service manages the interactions.

"The core of the service is a high-speed, reliable file transfer capability," said Vasiliadis. Globus uses an open protocol called GridFTP that provides multiple parallel streams between the two endpoints where the data are being moved, and within each of those streams is a further set of parallel threads used as part of the protocol.

In recent years the service has gone beyond that core focus. The ability to share data with external collaborators has been added, for example, along with additional services for data search and the automation of data flows that are common to large research institutes.

At the German Center for Diabetes Research (DZD), the data and knowledge management team uses graph technology to manage large sets of information and use it for prevention and individualized treatments for diabetes.

The DZD has a research network that accumulates large amounts of data distributed across various locations, including labs and hospitals. To address this challenge, DZD is building a master database, which is based on graph database management system Neo4j, to provide its team of scientists with a more holistic view of information.

In general, a graph is a mathematical structure with nodes and edges. "The nodes are connected to edges, which is—I think—the most intuitive way of designing a data model for biological data," said Alexander Jarasch, head of data and knowledge management at DZD.

In a browser-based manner, users can take a data-point such as a specific gene, patient, or experiment and see what data is connected to it. Jarasch said he started with the Neo4j technology back in 2017 "to build up the meta database from all our different sites and disciplines and also species."

Work completed to date involves human clinical studies, which has generated human data and measurements as well as metadata like the number of patients, the number of samples, and what was measured. Other work involves the storage of basic research data on animal studies, which is more of an ongoing process.

Some new projects are extending beyond diabetes to discover potential connections with other illnesses, such as cancer or Alzheimer's disease. There are efforts to connect different databases on a meta-level, he explained, "because we see that there are side effects or long-term complications from diabetes to cancer or from lung diseases to diabetes or back and forth."

Dealing with the Data Lifecycle

Designing, managing, and scaling solutions to deal with large amounts of research data boils down to effectively managing that data through its entire lifecycle, according to Mike Conway, a technical architect working in the realm of scientific research data management at the National Institute of Environmental Health Sciences (NIEHS) and their Office of Data Science.

It's not necessarily a question of scale, he explained. It's trying to address scale by working in upstream systems to improve the capture or recording of proper metadata that can be used as queues for data governance and policies and then automating those policies based on these properties of the different datasets.

From his vantage point, cloud isn't necessarily a data management solution. "Cloud is more an endpoint that just really kind of compounds your data management problems rather than alleviates them," Conway said.

"Our management problems really start when a sequencing run is done," he said, and there's a need to get that information to a point where it can be passed along to a bioinformatician or analyst. Those types of issues aren't addressed by cloud. It's more about gathering the metadata along the way until the data arrives at a point where it might make sense to push it out to a cloud service for later analysis or sharing.

Without the application of what he defines as data management, "what you end up with are terabytes and petabytes of data piling up on some vendor-supplied storage array," Conway said, where nobody knows what to keep and "you start getting into these governance questions."

If the raw data off the sequencer has been preprocessed, when can we get rid of that raw data? Is this human data versus animal data, and how are we differentially treating it? These are the types of questions that need to be tackled but are not addressed just based on where you're storing the data. "It's more about how we define those policies, how we automate them, how we can also get assurance that those policies have been applied," he added.

The Big Blind Spot of Data Management

Conway's background involves work with Reagan Moore, who developed the integrated Rule Oriented Data System (iRODS), an open source data management software, at the San Diego Supercomputer Center and later at the University of North Carolina's Data Intensive Cyber Environments (DICE) Center. This is the nexus from which his data management work with NSF and related DataNet projects stems.

These are examples of where the academic research side of data science is now starting to permeate government as well as biotech and pharma. In the same way that the Internet started in an academic research environment and percolated out, "I think that is also happening with data management, and it's really worth paying attention to the work that has been done on cyberinfrastructure on the NSF side," he added.

When he considers efforts to address data management challenges at NIEHS, Conway explained that the solution is not a homogeneous one. Based on his background, iRODS is being used because the platform automates the policy and provides the tools needed to handle the assessment of the application's policies. There is also interest in the use of pipelines and workflow languages to move data along the lifecycle.

Conway explained that much of the focus for the past year with the epigenetics core has involved the integration of a Data Commons with BaseSpace Clarity LIMS to gather information about the samples. This, he said, “allows us to apply governance on the data as it’s properly ingested in the Commons with the technical metadata.” Other efforts are more organizational policy in terms of data governance and retention.

He likened this process to Sim City, referencing the city-building video game series. Within the game, “you have a flat piece of land, and you have to lay the power lines, you have to lay the plumbing, and then you start seeing buildings sprout up around that,” he explained.

Similarly, cyberinfrastructure efforts can involve a combination of solutions. "We have iRODS, we have standard workflow languages for pipelines, and these sort of landing zones," Conway said. "We're also working with Synaptica on creating a metadata catalog of standard ontologies and vocabulary terms and working to get that permeated into the daily workflows of NIEHS."

"We are definitely looking toward the cloud," he acknowledged, especially through the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. Several cloud pilots are lined up, but all of the data management challenges he's referring to pertain to internal research projects and come before anything touches the cloud.

This is the "big blind spot" of data management from his vantage point. Most talk tends to revolve around data sharing—authentication, authorization, maintaining data sharing agreements, licensing, provenance, and creating permanent identifiers for shared datasets.

All of that is really important, Conway said, but there's a whole range of issues that are prerequisites before you even get to data sharing in the cloud. If there's junk metadata and disorganized data, there is plenty of prior work that needs to be handled before it can be published out as a shared dataset or reference collection.

“That’s sort of the unsexy part about data management,” he added, “but I think that’s actually the more important part.”

Paul Nicolaus is a freelance writer specializing in science, nature, and health. Learn more at www.nicolauswriting.com.