Building the Data Foundation for Faster Time to Science in Healthcare

November 23, 2021

Contributed Commentary by Bill Lynch, Pure Storage

November 23, 2021 | The COVID-19 pandemic pushed advancements in medical science to speeds once considered impossible. Consider this: it took over 20 years to develop a vaccine for polio and now, just 60 years later, the COVID-19 vaccine was developed and distributed to millions in a little over a year. 

The speed at which the vaccine was developed along with other advancements realized during the pandemic, including information sharing that painted a more accurate picture of the spread of the virus and the development of therapeutic treatments, has set expectations for the future of medical science and further fueled an already voracious appetite for faster time to science. 

Genomics and artificial intelligence (AI) are key to this quest and at-scale advancement of personalized medicine. Both, however, require liquid data and modern data infrastructure that re-imagines the role of data and how it is used.

It All Comes Down to Data

Legacy infrastructure can’t keep up with the exponential growth of life sciences data. Groundbreaking discoveries depend on quick access to data from patient records, imaging, genomic sequencing, clinical trials and medical claims data. But data growth is only one element of the problem: analyzing all of it together to mine meaningful insights is equally challenging with legacy infrastructures that weren’t built for today’s data volume or compute requirements. 

Three areas where legacy approaches to data infrastructure and processing are falling short include:

  • AI and Drug Discovery: The use of artificial intelligence to support drug discovery requires high-performance computing and a data infrastructure to match. Storage and compute demands continually increase as machine learning (ML) algorithms learn from drug candidates, patient records, images, and sequencing. Plus, wider use of natural language processing to extract information from thousands of scientific journal articles requires scalable compute resources. As teams train algorithms and introduce new use cases for machine learning, they require ultra-fast, all-flash performance for access to billions of objects and files, enabling research teams to complete analytics in minutes instead of days or weeks. 
  • 3-D Modeling and High-throughput Drug Screening: The expanded use of 3-D modeling during all phases of drug discovery is increasing storage needs. Three-dimensional modeling includes target identification and validation. It also supports assay development to guide high-throughput screening strategies, lead optimization, and candidate selection. The automated testing of thousands of chemical and biological compounds creates significant demands on an enterprise’s infrastructure. An expandable data platform is becoming requisite to release the power of 3-D modeling and high-throughput screening. Requirements for this new era include the ability to load data rapidly and ensure the delivery of consistent, multi-dimensional performance, as well as the ability to scale on demand. 
  • Genomic Sequencing: Genomic sequencing workflows start with the production of base-pair sequencing reads, and a single sequencer can produce 20 to 48 sequence reads per day, resulting in terabytes of data. Most life sciences organizations employ fleets of sequencing instruments, so it is not uncommon to quickly accumulate petabytes of reads that are then assembled into whole genomes for biological data mining, visualization, and interpretation. To keep pace with data growth and support concurrent sequencing workflows, pharma and biotech research firms face the need to upgrade their IT infrastructures to deliver a quantum leap in speed, scale, and agility. Assembling, mining, and interpreting sequences (known as secondary and tertiary analysis) requires a high-performance compute environment built on a storage architecture that delivers low-latency IOPS and high throughput. In particular, secondary analysis demands high-performance storage to support metadata access and concurrency requirements. As the number of sequence reads grows, research organizations need non-disruptive capacity scaling to prevent disruption of sequencing runs and scientific analysis. 

Enabling a Modern Data Experience to Support Faster Time to Science

A modern, agile, and scalable data infrastructure provides the foundation life sciences organizations need to realize the full potential of data, with analytics and AI tools layered in to achieve faster time to science.  The ability to add compute and storage on demand is essential, but only characteristics of the modern data infrastructure and experience that life sciences organizations require to continue to remove obstacles and accelerate time to discovery and market.

Flexibility in the cloud is essential:  Scientific research organizations need flexible options for data strategy. Increasingly, this means hybrid solutions that combine servers and storage, on-premises or in the cloud. The applications and processes used in life sciences are constantly evolving and IT infrastructures must evolve with them to provide cloud-like economics and support containerized workloads. The need and desire is growing to consume storage offerings via a true utility model for increased budget optimization and on-demand scaling. Support for containers, orchestration, and hybrid cloud configurations—via managed service providers and co-location providers—are also capabilities that give IT leaders the agility to change workflow design, resources provisioning, and where applications live without being locked into a specific location. 

Speed fuels discovery:  The genomic sequencing and analysis workflows that drive biomedical discoveries demand storage that delivers high performance and low latency. Solutions that incorporate flash architecture with independent scaling of performance and capacity can deliver petabytes of throughput and low latency access to hundreds of millions of files. For secondary and tertiary analytics processes, including genome alignment and variant calling, this type of architecture can accelerate processes by three to four times or more.

Simplicity is a force multiplier:  As concurrency increases and the amount of data from DNA and RNA sequencing grows exponentially, it is critical to have an architecture that can be easily scaled. Organizations must eliminate any complex configuration processes that delay access to researchers or result in downtime through human error. Look for solutions that automate internal network configuration and can dynamically manage load balancing for better ongoing performance. 

Protect data first, foremost, and always:  As clinical trial data becomes increasingly digital, there’s a huge need to protect patient information, research findings, and data backups in the event of an attack. Life sciences organizations require an analytics platform that enables data teams to synthesize intelligence from thousands of devices and quickly identify attempted intrusions. In addition, snapshots are an effective tool for keeping an attacker or rogue insider from deleting backups, even if administrator credentials have been compromised, and help protect data if an attack occurs. And, while rapid backups have been a priority, the focus is broadening to include rapid restore capabilities, with the ability to restore hundreds of petabytes of data an hour in the event of an attack.

While we can’t predict when the next pandemic or other global health crisis will occur, we can be better prepared. Life sciences organizations are at the ready to make discoveries that will lead to more personalized medicine and advance patient care. They just need a modern infrastructure and data experience to help them achieve faster time to science. 


Bill Lynch is Head of Life Sciences Strategic Alliances, Pure Storage. He can be reached at