Data: The Lifeblood of Life Sciences Research
Contributed Commentary By Kieran Maloney
May 19, 2017 | There’s no doubt that data plays a critical role in life sciences research today. I’m referring to digital data that is stored, transmitted, and analyzed on IT hardware. New research technologies are rapidly changing the discovery process, significantly increasing the volume of digital data and reducing cycle times. These advances in research are straining today’s IT infrastructure with changes in the lab happening so fast that the infrastructure must not only meet today’s needs, but also have flexibility for an uncertain tomorrow.
The Data Volume Challenge
I’m sure you are familiar with the National Human Genome Research Institute (NHGRI) data on how genomic sequencing costs have been dropping at a significantly faster rate than Moore’s Law. This is just one example of how changes in life sciences research are straining IT resources. Not only is the cost to generate data dropping significantly, but the analysis of the data is improving too, due to better analytic tools including increased use of artificial intelligence.
While the pace of scientific discovery is accelerating, the results are yielding tremendous value. The challenge is managing this exponential data growth with a storage infrastructure that complements and enables the discovery process and can scale capacity easily and affordably. In one example of this challenge, the Australian Genome Research Facility (AGRF), recently was struggling with data that tripled over two and a half years. This dramatic growth rate was forcing them to remove data less than one year old to make room for new data sets. This is the reality many research organizations find themselves in today, or will very soon; having to remove data to make room for new data.
The best approach to addressing this challenge is to follow the data through the discovery process, or workflow. What we typically see, is data falling into three primary steps:
This is a simplified view. In reality, there are likely multiple devices ingesting data, with multiple analytic processes occurring in parallel or in sequence, and data that is archived will move back for additional analysis and then return to the archive again.
The storage needs differ for each workflow step, making a tiered storage solution the best fit. For example, it’s important to ingest data quickly to ensure no data is lost during the capture or create event, and providing quick access to the research team. As a result, we typically see data streaming directly to SSDs for ingest. The analysis work is often done on high performance computing (HPC) systems, requiring high speed streaming of data. Here we typically see hybrid storage – a combination of SSD and performance disk – with performance and capacity sized appropriately for the workload. The workload depends on the analytic system (HPC and analysis requirements) multiplied by the number of systems being used. And finally, the archive system is a long-term repository requiring large capacity storage, with the ease of scale at a low cost. Depending on the frequency of archive retrievals, the volume of archived data, and the locations of the research team, organizations will use object storage, cloud, or tape – and sometimes a combination of these for the archive tier.
With a tiered storage approach, research teams can dramatically reduce their overall storage spend and enable greater performance where it’s needed. Since a small subset of the overall data is actually active at any given time, it’s important to have the non-active data stored on low-cost, high capacity storage.
With the volume of data being so large, it’s important to have a data management process that is easy to deploy and manage at scale. A good data management plan includes automated data movement across storage tiers, integrated with automated data protection to preserve the data. For example, having policies to create duplicate copies shortly after ingest, with a copy in active storage and one in archive storage. And then as the copy is removed from active storage a policy may create a second archive copy, depending on the durability of the archive storage tier.
The best storage and data management plan is meaningless if scientists can’t access the data. They need access when they want it regardless of IT details like operating system (Windows, Mac, Linux, UNIX), protocol (LAN, SAN, IP), or storage location (primary, archive, cloud). Not only do they need access on-demand, but today’s research involves collaboration so they need shared access. And to enable the efficiencies of a truly collaborative, parallel workflow, researchers need simultaneous access to the same file. To provide this level of access requires a single file system for all data under management, with multi-platform and multi-protocol support, and with a persistent file directory structure – meaning researchers and applications can find and access files in the same place regardless of storage location.
The changes in life sciences research are happening so fast it’s critical that IT storage keeps pace and prepares for tomorrow. Monolithic storage silos will eventually break down at the current growth rate; from hardware cost and/or management time. A tiered storage approach, with independently scaling performance and archive storage, along with policy-based data management is needed, along with shared data access for collaborative workflows. As new alliances form and new technologies are introduced, the storage infrastructure must be flexible to accommodate these changes so it enables the dynamic research process and doesn’t become a bottleneck to scientific progress.
Kieran Maloney is a Solution Marketing Manager with Quantum Corporation. With 20 years of storage industry experience, Kieran is focused on solutions to better manage the ever-increasing pace of data growth inside the data center. He can be reached at Kieran.email@example.com.