Data Are Forever

By Joe Stanganelli

June 7, 2018 | For attendees at this year's Bio-IT World Conference & Expo, perhaps the biggest focal point was the near-permanence of data—particularly juxtaposed against the limited contemporaneous uses for any given dataset in healthcare and life-science IT. Genomics, medical imaging, and other healthcare and life-science data comprise some of the biggest of the Big Data in the world. Compounding the problem is that, as a matter of course, clinicians have to go back several years (if not decades) to find experiments and other records. And yet, for every decades-old dataset so accessed, many more are recorded and left to gather digital dust "just in case".

"Science still has the data-hoarding problem because so much of a scientist's career progression [depends] on their publication history," observed Chris Dagdigian, BioTeam Co-Founder and Senior Director of Infrastructure, while speaking on this year's BioTeam panel at the conference. (See Bio-IT World's coverage here.)

"Data is forever," Linda Zhou, Director of Research and Life Sciences Solutions for conference exhibitor Western Digital, told Bio-IT World in summation—before following with a limitation: "When you give me a file, I don't assume the drive is going to be there forever… We don't hardcode data."

Plan Ahead for Migration

Consequently, Zhou explained, the best solutions lend themselves well to data migration across lifecycles. In particular, Zhou advocated for object storage for the kinds of large, unstructured datasets common to the bio-IT field—because so many parts make up the whole. Meanwhile, other exhibitors and speakers were bullish on containerization, largely because of the ease with which containers—and their data—may be migrated; additionally, they allow researchers and data scientists enough sandbox room to play without amassing too much technical debt.

"You can't just keep buying storage—because individuals may think it's infrastructure," Stewart Sherpa, an inside sales representative for DataFrameworks (another conference exhibitor), told Bio-IT World. "But it's not."

Indeed, organizations have run into trouble before by over-relying on expensively bespoke storage architectures that obsolesce and/or cloud vendors with prohibitive lock-in terms.

"The sheer amount of data that you are uploading to another provider cannot be moved overnight," Lance Smith, Associate Director of IT at Celgene, warned attendees in a breakout session he presented. "On a day-to-day basis, you're just not going to be moving clouds…and if you get the cloud wrong, you will get a bill that is insane."

For instance, life-sciences IT consultant Chris Dwan recently related to Bio-IT World a woeful anecdote of a client firm that had gone all in on cloud storage—but had so bungled its subscription finances that it wound up paying its entire budget every month just to keep the cloud provider from deleting the company's data.

Information vs. Boxes

Dwan and others told Bio-IT World that this Hobson's-choice scenario came to be because the company's IT department was uninvolved in the purchasing decision—reflecting some of the potential dangers of letting researchers and bioinformaticians take the DIY approach to IT.

"The data-science group should really focus on the information, and not so much [on] the hardware," said panelist Jerald Schindler, Vice President of Biostatistics at Alkermes, at the conference's day-two plenary keynote session on defining data science. "It's about the information. And so what you want are people who are focused on the information—not the box that it's in."

Not everyone at the conference, however, necessarily took the black-and-white view that data scientists must do only data science and IT must do only IT. Sherpa, for his part, explained that the one impacts the other—and that the people who put things into boxes effectively become subject-matter experts on the boxes that they need.

"I think people are asking IT to make these decisions—but they [in IT] are not the ones generating [the data]," said Sherpa of cloud- and other storage-buying processes.

Empowering Data Users

Indeed, Sherpa's remarks reflect the long-running trend in the bio-IT market that vendors have picked up on—that scientists demand technological empowerment without having to run to IT for every little thing. Moreover, Dwan offered the above client story as an example of how important it is for central IT to appropriately educate and empower the scientist-users in its organization—a philosophy echoed on this year's BioTeam panel that Dwan himself moderated.

"I think there's a big revolution coming in terms of self-service," said BioTeam-panel panelist Tanya Cashorali, CEO and Co-Founder of TCB Analytics. "It's going to be like email. Everyone had to start learning how to use email. There [are] a lot of tools coming out for the…non-coders."

"I think, for me, the biggest resource gap is [that] we're not applying enough human power to managing our data [or] curating our data to help whip it into shape," added Dagdigian. "It's cheaper to buy storage to keep storing crap forever."