Trends From The Socially Distanced Trenches: Storage, Image Analysis, And What’s Coming For Bio-IT
By Allison Proffitt
May 13, 2020 | The 10th anniversary of Chris Dagdigian’s Trends from the Trenches presentation kicked off yesterday via WebEx with a Nintendo-generated image of Chris wearing a mask, standing in a server room. (His signature black t-shirt shows up by slide #3). It’s a far cry from the standing-room-only auditorium at the Bio-IT World Conference & Expo in Boston where he has historically given this talk.
But changing times call for creativity and flexibility and Dagdigian, BioTeam co-founder and senior director of infrastructure, brought the same candor, speed, and insight to his Trends talk yesterday, again addressing hundreds of audience members, but this time virtually.
Image-based data acquisition and analysis is the fastest-growing driver of IT requirements based on BioTeam’s current projects. That includes more than just CryoEM, Dagdigian said, including confocal and lattice light-sheet microscopy, 3D microscopy, MRI, and fMRI image analysis. But genomics, machine learning and AI, and chemistry and molecular dynamics are also setting the stage for Bio-IT trends in 2020, Dagdigian said.
Genomics demands storage capacity, non-GPU computing, large memory computing and data ingest and movement. CryoEM and other image-based data acquisition and analysis workflows, however, need both storage capacity and storage performance, Dagdigian said. “In the past, I could get away with a slow but big file system for sequence-based informatics. When I’m dealing with images—and also ML and AI—we actually are starting to see changing requirements where we actually need faster storage instead of bigger storage or both big and fast storage.”
Machine learning and AI, in particular, require constant access to old data. “Particularly when you’re doing novel and model training, the issue is you’re constantly rerunning or reanalyzing you models, which means you need access to all of your old historical data.” This requires more performant storage than we’ve used in the past, Dagdigian explained.
Molecular dynamics (MD) also demands more cloud workload migration and scratch storage. “Chemistry and MD are interesting workloads to shift to the cloud, simply because they tend to be CPU-heavy and data-light. From a ‘lift and shift’ aspect, it’s relatively straightforward to move some chemistry workloads to the cloud where maybe data volumes were preventing you from doing a lot of your sequence-based informatics in a cloud environment.”
The biggest source of hype among all of these areas? “ML and AI, of course!” Dagdigian said. Contrary to other overhyped tech—Dagdigian calls out blockchain specifically—AI and machine learning are useful and relevant, he emphasizes, driving transformative changes in life sciences. But that doesn’t mean that everything you’re pitched is real. “You have to look out for aggressive sales techniques.” The terms AI and machine learning are slapped on everything from firewalls to generic software products.
If you are early in your adoption cycle, Dagdigian recommends caution and a bit of cynicism. Don’t adopt solutions you can’t fully exploit. The innovation cycle is happening so quickly, only buy what you can take full advantage of. Dagdigian recommends a “slow walk”. “The longer you delay your jump into the AI/ML space, the more of a chance you have to take advantage of the latest innovations.”
Next Stage Storage Problems
The common denominator in the sciences driving IT is the changing requirements of storage, and storage issues still make up a “giant percentage” of BioTeam’s consultancy business, Dagdigian said. “We’ve solved the capacity problem, but we’ve still got really gnarly data management, data curation, data governance, and data movement issues.”
Our data are big and only getting bigger and while in the past Dagdigian argued for a human data curator and data governance once an organization reached about 1 petabyte, BioTeam still routinely sees organizations with no such standards or curation after more than 10 petabytes of data.
Dagdigian didn’t mince words: “People are just mindlessly expanding their storage systems rather than making the hard choice to take hold of their data, triage their data, organize their data, wrangle this data. That was ok if you didn’t have effective scientific leadership and you were ok just throwing money at a problem and you wanted to avoid having difficult conversations with scientists about moving or deleting data. That was fine—until now.”
Organizations and institutions are doing themselves and their R&D pipelines real harm with this sloppy approach to data management, Dagdigian warned. “Data awareness is a competitive differentiator,” he said. “Organizations can succeed and fail on this capability alone.”
He argues for a culture change. Storage must be considered a lab or group consumable and be tracked, budgeted for, managed, and used the same way you use other expensive consumable laboratory resources.
Storage allocations must also belong to the group or projects, not individuals. “We’ve sort of proven that we can’t sensibly behave in that world,” he said. He recommended taking as a best practice the file system quotas from The National Energy Research Scientific Computing Center. Each user can have 40GB of personal data space, but anything beyond that must be stored according to project.
Siloes At The Edge
Beyond just data that are big—and expanding rapidly—our data remain siloed between increasing data sources and endpoints. “If your network is not up to snuff, if you haven’t started thinking about science DMZs and separating data traffic from your business traffic on your network, we’re starting to see situations where data can be stranded at the edge.”
The good news: compute at the edge is truly feasible now and can be used to manage data ingest. A simple example: do a brief quality control test on the data coming off an instrument before it’s ingested for analysis and storage. Data that don’t pass muster can be discarded immediately.
Of the common architectures, Dagdigian uses data lakes, calling them effective but with many failures. Data commons are increasingly attractive; BioTeam likes the Gen3 architecture. (Read more: Building A Commons: How Bristol-Myers Squibb And BioTeam Used Gen3 To Build A New Data Paradigm) And he advocates for science DMZs.
Dirty Data, Dirty Outcomes
Dagdigian reserved his strongest criticism for the dangers of dirty data—data collected with no SOPs, governance, metadata, curation or standards.
“IT can’t do this for you. I solidly place the blame for this issue on scientists, and, honestly, scientific leadership who are—oddly enough—fearful of having difficult conversations with their scientists or unwilling to budget or unwilling to resource capabilities that they’re absolutely going to need,” he said.
Until now, this has been annoying, but it will have very serious impacts on our ability to use machine learning and AI without bias in the future. “This is really, really going to bite us, particularly as we move into the ML and AI world where cleanliness and cataloging and curation and tagging of our data is really essential if you want to get accurate results to come out of your ML and AI workloads,” he warned.
With our track record of messy data, bias will be an increasing area of concern as life sciences employs more machine learning and AI.
“It is our responsibility in the bio-IT space—it’s our responsibility as scientists—to understand the risks of biased data: unconscious, explicit,” Dagdigian said. We are responsible for gathering clean data from diverse and equitable sources on which to apply machine methods. Without clean, unbiased data we will not generate useful results. Bad inputs generate bad models which produce bad outputs that lead to bad outcomes, he warned.
Editor's Note: If you missed the live session, you can still view the recording. It will be available on demand here: https://www.bio-itworldexpo.com/trends-in-the-trenches-webinar by the end of the week.