Trends From The Trenches: 6 Themes (And A Few Observations)

By Deborah Borfitz

April 29, 2019 | At his 2019 Trends from the Trenches address at the recent Bio-IT World Congress & Expo in Boston, BioTeam co-founder and senior director of infrastructure Chris Dagdigian shared his pearls of wisdom on the current state of scientific computing, a field where incompetence is now an "existential survival threat" to life science organizations. "We've done 'OK' entering the data-intensive science era, he says. "The hard part is managing what we have."

Dagdigian began by sharing a few general observations, notably that leadership still views scientific computing as a "cost center to be minimized" rather than a core competitive differentiator and HR recruitment and retention tool where insights and value routinely get extracted from data. The user base is also climbing and includes both seasoned scientists forced away from familiar, laptop-scale analytical methods and new hires often showing up with prior high-performance computing (HPC) and cloud expertise. Companies are "pretty bad at training," he adds, especially when it comes to helping intermediate-level users become experts.

The definition of HPC is also "being stretched in extreme ways," Dagdigian says, putting it "in danger of becoming a dumping ground for problems that don't fit on cheap leased laptops." The rate of software and tooling innovation is also happening faster than IT can curate and maintain development and execution environments.

"It may be time to start looking at things like Intel commercial compilers," Dagdigian says. A modest investment in compiler and toolchain optimizations could pay significant dividends, given the high cost of graphics processing units (GPUs), NVLink high-speed GPU interconnect and Nvidia DGX-2 platform ($400,000 list price). The performance differential between stock (Relion Cryo-EM) and even upgraded compiler developer tools (GCC-7 on CentOS/RHEL 7) is dramatic relative to Intel ICC compiler (Intel Parallel Studio), he notes.

The most prevalent Relion benchmark, favored by vendors, uses a 50 gigabyte input data set—small enough to fit in RAM—but that's an unrealistic test for anything other than getting compiler and CPU/GPU optimizations correct, Dagdigian says. Experimental data sets in 2019 are generally no smaller than 60GB in size. BioTeam is proactively seeking multi-terabyte CryoEM data organized for Relion 2D or 3D classification that it can share and publish its own results testing against.

Monster Data Flows

Artificial intelligence (AI) and machine learning (ML) are both "awesome and ugly," says Dagdigian. "You are going to lie or die on the quality of your training data. I think there is going to be extreme pressure to get the cleanest and best data," tempting people to take "ethical shortcuts."

Organizations are starting to employ creative strategies for gaining access to high-quality training data, in one case by providing access to sophisticated analysis tools on the cloud in exchange for data to feed its algorithms, Dagdigian says. The opt-in data sharing process generates 30,000 de-identified, metadata-tagged MRI scans per week that are getting dropped into medical and data workflows.

Little has changed in the networking arena over the last 18 months, Dagdigian says, the most shocking news being that Nvidia purchased Mellanox. "Moving scientific data across networks is still problematic, we're still doing a bad job of rolling out 40-gig and 100-gig networking, enterprise IT is still concentrating on the data center rather than building out edge and labs, we still need to separate science data traffic and business network traffic, and our connections to the outside world are too small. Our firewalls and security controls are also still designed for the way business consumes the internet and not for moving a massive amount of data as a single monster flow."

Cloud has been a capability play for life science organizations for the past decade, but cloud providers might start feeling a little pushback in 2019, says Dagdigian. Serverless computing is transformational, but discovery-oriented science still "relies heavily on interactive human efforts with bespoke tooling." Cloud marketing hype aside, "not everything can be distilled to an API [application programming interface] and built into a service mesh architecture. That is not going to happen in our world."

Two concerning developments are cloud efforts to build bespoke accelerated hardware for AI, ML, and inference, which will add complexity to an otherwise simple cost and capability evaluation process, says Dagdigian. The other is the consistent scarcity of GPU resources on Amazon Web Services. Not being able to get GPUs in a timely or inexpensive manner might drive some colocation and on-premise decisions.

Predictions for 2019

In his latest Trends talk, Dagdigian kept the focus on six themes the industry can expect to see in 2019:

1. The unit cost of storage vs. consumption rate will force hard choices. The unit cost of data storage is decreasing, but not as fast as data is being generated, Dagdigian says. Meanwhile, the cost of running a sophisticated IT computing operation continues to mushroom—compelling some companies to cut research mission simply to sustain themselves. Governance and HPC resource allocation need to shift from IT groups to scientists and include some sort of data triage process. For open-ended moonshot projects, researchers could perhaps be handed internal credits or grants to "spend" however they see fit, he says. "We're going to have to get serious about this."

Importantly, companies need to have true operational cost data and be transparent about usage of increasingly contested resources within their scientific computing environment, with "good logging of scientific tools and codes being invoked," Dagdigian says.

"There is some value to lean times," he quickly adds. "It's the lean times that allow you to blow up silos and to de-crust your infrastructure or org cart and… force organizations to come together around APIs and middleware and common ways of doing things." Organizations will start thinking hard about how to run their technological operations in a different way on a larger scale.

2. Storage selection pendulum has made a seismic shift to big and fast. There have been several fundamental changes on the storage landscape, notably that "the capacity/performance calculus has swung the other way," says Dagdigian. "We now need very fast storage to handle machine learning, AI and image-based workflow requirements." ML training and validation also require ongoing access to "old" data. The dominant file type is no longer genome sequences but image-based (e.g., 3D ultrasounds).

"A tremendous among of deployed storage is nearing end of life or end of contract support," he says, and "really interesting next-gen storage companies have entered the market." Parallel storage systems have also become more attractive to life science companies, who may also be willing to bear more of the administrative and operational burden of getting the performance needed to support a new class of scientists and AI work. The benefits of scaling out network-attached storage are less valuable in context, he says.

"The new requirements for speed plus capacity is deeply scary," says Dagdigian. "We can't trade away performance in exchange for larger capacity anymore. I think there will be lot of [enterprise IT] storage platform switches over the next couple years."

3. End users need to start taking responsibility for their data. The level of knowledge required in the life sciences is becoming too much for a central support organization, says Dagdigian, so domain expertise will be moving to stakeholders in labs and R&D organizations. The problem is that scientists and even leadership typically don't know what data is important and therefore treat it all equally, resulting in significant operational overhead and overspending—and data hoarding.

Dagdigian admits he used to be a fan of single namespace storage, believing it would prevent scientists from doing "wildly inefficient things like storing data in three different locations," and it was the job of the IT department to provide it. "I'm over that," he says. "I think in general we've done a bad job of encouraging users to actively take responsibility for their data."

It remains "wildly inappropriate and impossible" for IT staff to make data classification decisions, or to know the value of a piece of data and if it can be moved outside of a geographic boundary, he continues. Scientific data handling—the actual work of classifying, curating, tagging, moving and life cycle managing data—needs to be pushed back to the researchers who need and use the information. A notable exception would be large-scale physical data ingest, export, and movements. The IT department can also worry about storage that meets business and scientific requirements, and providing users with self-service metrics, monitoring and reporting tools.

4. Compilers, toolchains and silicon matter again. Only a few years ago, it didn’t much matter where you got your hardware and chips (invariably Intel) because they were all relatively the same, Dagdigian says, outside of some GPU divergence for purposes of visualization versus simulation. "Now we have to care about CPUs [central processing unit] and GPUs and things called TPUs [tensor processing units]."

Expect to see greater competition between Intel and AMD, which introduced a new line of microprocessors in 2017. The advent of AI and ML has also brought a flood of new processors to the market and the emergence of five hardware vendors trying to differentiate themselves in this space by building custom silicon and software frameworks.

The landscape is also increasingly complicated for GPUs—now needed for AI and ML as well as virtual desktop infrastructure, data visualization, molecular dynamics simulations, chemical informatics and Cryo-EM—and different products and memory configurations are needed for different tasks. Different numbers of GPUs are also needed per chassis.

"It's time to resurrect the benchmark and evaluation crew," Dagdigian says. The real-world cost of GPUs, CPUs and TPUs need to be factored into cost analyses.

5. Colocation facilities are being used more often. Economics still favor on-premises for 24/7 scientific workloads, Dagdigian says, at least when cost concerns aren't superseded by capability or business requirements. Cloud-based computing is "easy and well understood and serverless is… transformative, but persistent month-over-month costs in the cloud combined with petascale egress fees make the economics challenging."

One "sign of the times" is that data center and telecom developer Markley Group recently named Steve Litster, formerly global lead for scientific computing at Novartis, as its chief technology officer. BioTeam has an active on-prem to colocation project with the Markley Group, Dagdigian notes. The colocation trend has lots of drivers, including the high cost of new builds ("tens of millions of dollars and you usually have to build more than one facility") or upgrades to on-prem facilities, poor cloud economics for some workloads and use cases, consolidation activities, and the ability to aggregate cloud traffic in a colocation suite.

6. Life science HPC stands apart for the sheer size and diversity of its domains and workloads. In other industries, the world of HPC and supercomputing has a modest set of dominant and domain-specific codes and "the application landscape is approachable," says Dagdigian. Not so in the life sciences, which is characterized by "crap code," vast numbers of applications (more than 600 spanning 10 domains), highly specialized subdomains and infrastructure requiring support.

"Individual scientists can now swamp a leadership-class supercomputer with completely valid research questions," Dagdigian says. "That's not sustainable," which is why governance and service scope constraints will become more prevalent.

Dagdigian adds that he favors service-oriented scientific support organized around use cases and end user requirements rather than technological expertise. This "team of team" approach to service delivery is "a great way to blow away traditional IT silos."