Researcher-Driven Data Stewardship, Storage: Trends from the Trenches 2020

By Allison Proffitt

October 14, 2020 | On its 10^th anniversary, the annual Trends from the Trenches presentation from Chris Dagdigian, co-founder and principle consultant of BioTeam, and friends had lost none of its characteristic candor and punch—even in a virtual environment and even as Dagdigian successfully continued to limit his own speaking time.

In 2020, the scientific drivers for research IT continue to be genomics and bioinformatics, image-based data, machine learning and artificial intelligence, and chemistry and molecular dynamics. Dagdigian flagged image-based data, especially CryoEM datasets, as the fastest growing IT driver that the BioTeam has ever seen. However machine learning and AI took the prize for the hardest problem.

Usually, Dagdigian said, IT purchasers can choose between fast storage or big storage. A storage environment will be broken into wedges of high performance storage along with slower archive tiers. But the constant iteration and retraining required for machine learning—against both old and new data—renders that tradeoff null. It’s a radical shift, Dagdigian said. “Not only do they need very, very fast storage because they’re actively analyzing new stuff, but your storage namespace and your storage platform needs to be very, very large as well as they’re often reaching back in time, and they’re constantly rerunning old data.” The change will have an impact on the historically dominate scale out NAS vendors, he predicted.

While Dagdigian again flagged ML and AI as one of the most overhyped technologies of 2020, he blamed sales teams, not the technology itself. “Unlike blockchain,” he quipped, machine learning and AI are real, beneficial technologies driving transformation. But the terms are on marketing overdrive—with “tech bros” saying “creepy” stuff to sell the idea to the C suite or the Board, he observed. Instead, adoption of these technologies should be driven by the scientists, engineers, and other end users who deploy them.

In 2020, scientific data are big, siloed, biased, and dirty, Dagdigian said. We have failed at data management, he said, recalling that he used to recommend at least one employee dedicated to storage management once an organization reached a petabyte of data. Now, he said, he’s seen multiple petabytes of storage with no manager, no curation, no governance: “rudderless leadership!”

While Dagdigian used to task scientists and researchers with data triage—boldly challenging them to stop whining and accept that this is part of their job—he now advocates for an active scientist-IT partnership. Scientists should still stop whining and accept responsibility for their storage use, he said, but, “new and sterner methods are required.”

He advocates for an experiment or project-assigned storage model with little to no personal space for researchers to fill. When assigned to a project, storage will be justified, budgeted, and defended like all other lab consumables. Researchers will need to shift their thinking, and IT staff will need to become more comfortable with hard conversations about priority use and data hygiene.

Instead of policy-driven auto-tiering—which he advocated for last year—Dagdigian shifted to calls for self-service storage for scientists. Scientists think in terms of projects, experiments, and groups, he said, and should be able to archive data in the same way—but without requiring an IT ticket to do so.

There are more data silos and sources than ever, he said. Data rich environments are now diffused and at the edge. And data lakes are effective sometimes, though there are many failures. Data commons are increasingly interesting, he said. BioTeam has been working with Gen3 Commons recently and is impressed.

Last year, Dagdigian expected BioTeam to have interesting work with compilers and CPU benchmarking, neither of which took off for their business, though he did suggest keeping an eye on Intel, AMD, and ARM64.

Finally, Dagdigian didn’t totally let scientists off the hook when it comes to data management. I’m tired of scientists who build careers on “data intensive science” whining about having to actively manage data that drives their career, he repeated.

While he doesn’t claim to have gotten everything right, Dagdigian’s core message of the past few years—born of years of experience in the trenches—has remained consistent. Especially in the age of ML and AI, we need clean data from diverse and equitable sources to have any hope of applying machine methods broadly and across our many disciplines. And that must start with the researchers themselves.

Editor’s Note: Even if you missed the start of the event, Bio-IT World Conference & Expo virtual is still live. Register now for on-demand presentations.