Trends in the Trenches 2021: Chris Dagdigian on Security Vaporware, Egnyte, and the Data Management Culture

By Allison Proffitt

September 28, 2021 | Chris Dagdigian began yet another Trends from the Trenches plenary session at Bio-IT World trying to abdicate his position. This year he came with friends—both in person and Twitter recommendations—to further dilute his standing as Bio-IT’s fastest-talking, most candid commentator. I’m not sure it worked.

He shared his bully pulpit with Karl Gutwin, Adam Kraut, and Fernanda Foertter, all current or former BioTeam consultants, who reported on digital transformation, data transformation, and AI. He also plugged the Twitter accounts of @HPC_Guru, Chris Dwan (@fdmts), Glenn Lockwood (@GlennLockwood), and Deepak Singh (@mndoci) as resources for high domain insight and more or less permission to speak freely.

As always, for his portion of the talk Dagdigian brought an overview of the trends and a warning of the technology pitfalls facing the Bio-IT community. He continued to beat his drum on a few cultural issues Bio-IT still needs to address, and he alerted the community to what’s new and notable.

COVID and Security in 2021

Dagdigian began by outlining some of the changes COVID has wrought on the Bio-IT Community: larger, distributed workforces, API-controlled wet-labs-as-a-service, edge computing to support distributed people, instruments, and processes, and increased security risks that are phasing out “hard shell” security and firewalls.

The security issues are significant and Dagdigian challenged the audience that security tooling, observability, and capability must fully diffuse throughout the organization.

“When we were talking about device inventory and device discovery, we used to be talking about just laptops. Now we’re talking about, ‘What’s that strange sensor that is attached to a refrigerator? Why is it talking to a weird destination in an unknown country?’ We’re really having to up our game.”

New capabilities will be required, he said, including pervasive network access control, being able to block unrestricted internet access and forcing download of patches, and viewing smart devices and IOT as high-security internal threats and likely sources of intrusion.

He advocated for a ZeroTrust posture—to people, ports, and devices—while also flagging the term as the buzzword of the day. “For those of you who know me, I will call out the BS-laden marketing stuff. ZeroTrust is definitely in the realm of trust but verify. Your vendors are probably lying to you; the marketing people are probably running amok. The proof-of-concepts are probably sponsored or artificial.” Independently test or verify vendor capability claims, Dagdigian said.

But he doesn’t despair of ZeroTrust ever becoming reality. Machine learning and artificial intelligence have emerged as real capabilities, he reassured his audience. But Infosec-for-Life Sciences—anything involving next gen security or endpoint detection and response (EDR)—“is at Peak Marketing.”

Egnyte Explosion

The most surprising 2021 trend in Dagdigian’s consulting caseload, he said, are the number of pharma and biotech and startup customers who have started using Egnyte or Box for storage and file synchronization—“like 90%” Dagdigian said.

For the most part, these users are really happy, he conceded. But he still wondered at the long-term utility of an application that has a 100GB maximum file size, MacOS/Windows-based user interfaces that are “files and folders” based and assume predominantly human users, and what seems to be “atrocious Linux support.” The cloud egress fees associated with external data sync traffic are also alarming, he added.

“This is a major trend for me and ultimately, I don’t want to trash Egnyte. I will say, candidly, that my customers are actually very happy with them,” Dagdigian repeated. “I have no intention of fighting them, blocking them, kicking them out. My job as a consultant is to figure out how to make it work in my world, and my world is peta-scale, highly performant, scientific computing that has to integrate object storage, petabytes, lots of readers and writers. Honestly, I’m a little bit worried. It’s going to be an interesting paradigm. The best design pattern I can come up with right now for an Amazon environment is converting all of my scientific computing over to Windows file sharing protocols and selectively syncing stuff like this into to the AWS equivalent of a managed file share.”

Data Policy Revolution

One of Dagdigian’s refrains from previous years—the importance of data governance, data discovery, and data classification—seems to be hitting home. For the first time in 2020-2021, the BioTeam’s consulting workload is measurably changing toward data commons projects, data governance projects, organization-wide assessments of data policies and architectures, and network refresh/redesign efforts, he reported.

“The good news on the policy perspective is I’ve really felt we’ve punted on data governance and data management for many many years, and it feels like the tides are changing,” he said.

That shift in mindset opens some interesting options, Dagdigian said, and there is no one-size-fits-all solution. It should also prompt honest internal assessments of organization culture and communications/collaboration practices.

“There’s a lot of really interesting sort of gnarly procedural, cultural, and org stuff that you’ve got to get through as soon as you get serious about data governance and data management,” he warned.

Science—Old and New—Drives IT

The drivers of storage haven’t materially changed in Bio-IT, Dagdigian observed. Genomics and bioinformatics have historically been the predominant driver of IT development, but as he mentioned early last year, images and image-based data are now the dominant and fastest growing storage type in scientific computing environments. ML and AI require both big and fast storage to continuously access old data for reuse. And chemistry and molecular dynamics are capable of near-infinite amounts of compute capacity.

Computational chemistry and modeling are an argument for premises-based HPC, Dagdigian interjected. “GPUs are getting harder and more expensive to find in the cloud. I have to do region hunting all the time on Amazon to find GPUs for my chemists. If you have a 24/7 compute load that is GPU-heavy and you really care about the economics versus capability, we’re seeing a lot of GPU clusters going into premise systems. Basically, it’s a guaranteed resource and the economics work out as long as you’re keeping that GPU running basically 24/7. If you have bursty workloads, cloud is still better both from a capability, technical, and even cost perspective in most cases.”

But not all data sources are from the usual suspects. Dagdigian highlighted wearables, sensors, time-series data, and other streaming data as a new challenge to ingest, store, and exploit. Those types of data can be difficult to store on standard storage systems, he said. “You often need things that kind of look like databases.” These types of streaming data will likely require a mix of new products and technologies, though he highlighted that cloud vendors are offering tools specifically for streaming data—AWS Kinesis, for example.

Cultural Revolution

Dagdigian consistently advocates for changes in culture within the Bio-IT community, and this year’s Trends talk was no different. He again lobbied the community to view data as currency, storage as a consumable, and to dramatically rest the dynamic between research-IT and end-users.

It’s time to view data as currency, not only spending wisely, but also managing wisely, he said. Not understanding the true value of data leads to hoarding, massive inefficiencies, and inability to properly leverage the data at hand. He proposed a litmus test for your organization: If leadership says “all our data is important” that is a much bigger red flag than “we don’t know how to figure out what is important,” he warned.

If data is a currency, scientists must shift their thinking to storage as a consumable.

Dagdigian’s most recurring Trends from the Trenches theme is data storage and management. For years he has predicted that the future of scientific data at rest will be object storage, and he’s highlighted that humans are no longer the dominant consumer of stored data. The human user experience should be handled at the presentation layer; storage architecture can’t be designed for the human user.

For years the challenge has not been acquiring data but managing it.

“It’s really easy to tell your boss you want to go out and download a few petabytes of open access scientific data, but it’s very expensive and operationally burdensome to keep that data available: make it fast, make it performant, make it protected. And so there’s been this friction: it’s easier to acquire data than it is to actually manage data.”

The solution to this tension, he argues, is for scientists to take active roles in managing their scientific data. Drawing on how he’s seen labs manage consumables like reagents and flow cells, Dagdigian proposed that end users plan and justify their storage use in a similar way, making both business and science cases for purchase.

Data As Your Job

That scientist-driven approach balances the old approach to data management, he said. IT-led automated data management has really failed. Policy-driven choices based on file age or last access data just don’t reflect how scientists use and manage data. Scientific data life cycles instead work on a human-controlled “project,” “publication,” or “project” basis.

This plea has become even more earnest now that new life sciences workflows—AI, ML, and cryoEM, for example—require storage that is not only big, but fast. We’re reached a tipping point, Dagdigian said. These new capabilities call for new storage architectures. “I used to be big on tiering and namespaces. Tiering and namespaces and archive get a lot harder when you’re doing ML and AI, because ML and AI always needs to go back to the old data to retrain and reclassify,” he said. These new scientific capabilities have “forced us to build very large, single namespace, highly performant storage systems.”

And as scientific computing has driven the changes in storage architectures, so scientists must drive the changes in data management.

It’s on this point that Dagdigian is his most candid: “Stop whining and deflecting data curation tasks to IT and take some freaking responsibility for the stuff that your career is built off of,” he wrote in his slides. “I’m no longer friendly about that,” he announced to the audience. “It’s been annoying enough that it’s time. If data is your job, managing data is also your job. It’s not the IT person’s responsibility anymore.”