Evolving from ‘My Data’ to ‘Our Data’: How a Global Data Environment Can Advance Population Health Research
Contributed Commentary by Molly Presley, Hammerspace
October 21, 2022 | The life sciences community has long faced challenges around data management, data sharing, and the IT infrastructure needed to support scientific research. More powerful and higher-resolution instruments continue to generate data at an unprecedented rate, and the computational resources required for discovery continue to multiply. Increasingly complex and decentralized workflows stretch both resources and budgets in multiple areas.
These challenges are compounded in a world shaped by COVID-19, where global health is at the forefront of scientific research, and data drives understanding and solutions.
Harmonizing Data and the Impact on the Health of the Human Population
Over the past decade, laboratory technologies—used by life sciences and biomedical researchers to analyze disease conditions—have been deployed and adopted faster than ever. Consider Moore’s Law: with technology increasing exponentially, technological innovation in the laboratory has been cubic. Technology is developing in weeks and months; previously, it was on the order of years.
Data technology and formats are also becoming more diverse in scientific research due to the increased use of computationally-intensive tasks like simulations, genomics sequencers, and viral analysis. For example:
- COVID research utilizes technologies like cryo-electron microscopy, a type of electron microscopy
- Supercomputing uses metrics known as hero numbers that serve as benchmarks for a particular data flow
- Researchers use a concept called extract, transform, and load to transform data into a format that allows them to analyze and gain knowledge from it
In all, there are about 40 standard data formats, which quickly becomes challenging when you’re trying to look at data in a larger context beyond just a specific research project. For example, multiple sequencer manufacturers use many file and sample preparation formats.
COVID Research Is a Case in Point
A fundamental challenge for COVID analytic data, especially early on when the virus was largely unknown, was the countless variety of collection formats. To begin an analysis, data scientists and curators had to figure out how to bring those data points into context. In order to help a globally dispersed and diverse research community, the attached metadata must describe where it came from, the sequencer used, the sample type, preparation steps, and so on. Then, once analyzed, it was tagged with information such as what family it belonged to and other defining characteristics.
Global researchers worked together to create a global analytics collaborative as quickly as possible. By collaborating with cloud providers, and using local and collaborative supercomputing groups, the research and life sciences communities could generate the entire molecular structure of the SARS-CoV-2 virus in a matter of weeks. This task was made possible through infrastructures that supported high quality, high speed, highly collaborative analysis, and globally shared results.
Likewise, the research behind the effectiveness of masks involved computational fluid dynamics simulating masks to understand the impact to the surrounding environment when someone coughs.
Designing Infrastructure With Data Sharing and Collaboration in Mind
A global data environment also supports global accessibility and collaboration between geographical regions. For example, genomic sequencers generate massive amounts of data, making them prime candidates for high-performance computing environments.
However, life science researchers operating in extreme and remote environments like sub-Saharan Africa are operating on the edge. Not only do they not have access to HPC compute clusters, but they may also have little to no electric power and internet connectivity.
This is where equity design comes in. Equity designers develop hybrid computing solutions that enable large-scale data sharing and analysis between energy-poor regions and advanced labs. In the life sciences, research and development are paving the way for more precise health measures and personalized medicine in remote communities worldwide.
Best Practices: Switching the Perspective on Data from Mine to Ours
This type of global collaboration underscores the need for a cultural shift to view data no longer as “my data” but “our data.”
In order to enable data accessibility and availability, a global data environment is needed to eliminate storage silos under a single global namespace and encourage collaboration across a distributed workforce with distributed data storage. Significant steps forward include:
- Creating actionable, metadata-based policies to enable complete control and lifecycle management of an organization’s data. Define and enforce policies for your data to meet cost, performance, locality, security, and protection objectives.
- Making all data accessible and available from any location while simultaneously ensuring it is stored and protected on the most cost-effective storage tier.
- Enabling simple and automated multi-cloud agility and elasticity. This approach allows researchers to instantly access computer resources and storage across any cloud or region. IT teams can also meet unforeseen peak demands and access new technologies.
These steps would meaningfully advance the ability of the life sciences community to manage and share the vast volumes of data needed to support scientific research and discoveries.
Molly Presley brings over 15 years of product and growth marketing leadership experience to the Hammerspace team. She has led the marketing organization and strategy at fast-growth innovators such as Pantheon Platform, Qumulo, Quantum Corporation, DataDirect Networks (DDN), and Spectra Logic. She was also responsible for the go-to-market strategy for SaaS, hybrid cloud, and data center solutions across various data-intensive verticals and use cases. At Hammerspace, Presley will lead the marketing organization and inspire data creators and users to take full advantage of a truly global data environment. She can be reached at email@example.com.