AI Ushers in HPC Revival Says TACC’s Dan Stanzione

April 23, 2024

By Allison Proffitt 

April 23, 2024 | High performance computing is enjoying an extreme vindication, said Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin and Executive Director of TACC. In the opening plenary presentation at last week’s Bio-IT World Conference & Expo, Stanzione said that after years of cloud computing hype, artificial intelligence now demands traditional high performance computing architecture.  

That’s a space Stanzione knows well. The Texas Advanced Computing Center—TACC—is home to some of the largest supercomputers on the planet and Stanzione has been there for more than 14 years. Though located at UT Austin, TACC is mostly supported by National Science Foundation funding, explained Stanzione. “‘Largely government funded’ means if you want time with our systems, I mostly have to give it away!”  

TACC’s stable of systems includes Frontera and Stampede-2, which both make the Top500’s top 50 list. Jetstream and Chameleon are also NSF systems. Two other systems—Lonestar-6 and Vistafor—are for Texas academic and industry users. Altogether the compute power is significant. Stanzione reported about 1,000 GPUs, 1million CPU cores, and seven billion core hours. “We have a couple hundred petabytes of data, [and are] well on our way to exabytes fairly soon.”  

Lest those numbers suggest a purely IT mindset, Stanzione emphasized that TACC lives in the research part of the university. “From my perspective, we don’t do IT. We build large scale research apparatus, and we happen to use components that come from IT to do that.”  

He believes that modern computational science is made up of three batches of work: simulation (computationally querying mathematical models), analytics (computationally analyzing experimental results), and machine learning/AI (computationally querying datasets, whether they originated from experiments or simulation). Science and engineering demand all three, he said, and the TACC systems exist to support that.  

But interestingly, Stanzione says that the life sciences have only fairly recently needed to take advantage of the TACC’s computational strength. Before 2005, he said, any life sciences jobs were really chemistry and molecular dynamics jobs, “using the name ‘life sciences’ to be more fundable.” Then TACC focused on things like modeling tornadoes and rocket engines, colliding black holes, and hypersonic design. Now, as modeling has advanced in biology, the workloads have expanded to include things like HIV capsid modeling, COVID modeling, and coronary artery flow modeling. In 2022, Jose Rizo-Reyes used Frontera to model vesicle fusion between neurons—the suspected precursors of thought.  

Life Sciences Punch List 

Compared to the physics, climate science, and engineering jobs of the past, life sciences work has made different demands of TACC, Stanzione said. Life sciences code changes frequently, he said, pushing the TACC team to adopt containerization. Life sciences researchers are engaged (read: impatient), so responses became more interactive and real time, rather than batch work. Life sciences jobs are not generally mechanistic, so workflows became data intensive. Because samples are sensitive and unstable (read: gooey), TACC adopted web services and automated workflows.  

These changes have enabled new life sciences projects. For example: the Acute to Chronic Pain Signatures Program (A2CPS) is working to identify biosignatures that predict addiction to pain medications. The Data Integration and Resource Center is compiling all the A2CPS components and serving as a community-wide nexus for protocols, data, assay and data standards, and other resources.  

AI’s Mind Shift 

Now, Stanzione said, life sciences research projects (among others) are calling for AI.  

Artificial intelligence in the form of neural nets has been around for decades, of course. But Stanzione said recent high performance computing advances—silicon processes, GPUs—have made scaling much easier. And algorithm advances—specifically Google’s discovery in 2018 published in Attention is All You Need—made handling very large neural nets possible by splitting them into multiple compute servers.  

Building large language models requires high performance I/O, high performance computing, and high performance communication, he pointed out, all foundations of high performance computing. AI requires HPC clusters—not the cloud or enterprise computing—he argues.  

But while Stanzione believes wholeheartedly that HPC built AI as it is today, he also acknowledged that, “the genie is out of the bottle” and now AI is the “gravity well,” that will drag high performance computing toward its particular needs and preferences.  

“We’re going to have to live with the hardware built for AI, which means putting mixed precision methods in modern scientific computing. Data science is also a gravity well, which means we should give up and write everything in Python. So I’m sorry to the C and Fortran aficionados out there, but that’s just the way it’s going.”  

Stanzione expects to see more AI users than HPC users on the TACC systems in the next three years. He predicts other changes too, many moving counter to changes the Cloud brought on about ten years ago. The rise of AI, he said, vindicates the HPC model. AI needs HPC hardware, which is pushing up the cost of GPUs. The market for AI-driven hardware will be $300 billion by next year, he said.  

AI needs fast interconnect, and both latency and bandwidth matter. “I’m betting on low latency, ultra ethernet implementations taking over from InfiniBand at some point,” he added. AI needs messaging passing. “MPI, the message passing interface, was a standard built out of the mostly academic open source HPC community, but now it’s the standard library under all of those transformer-based generative AI tools,” he noted.  

And AI needs big, scalable IO systems and a network. There’s no point in letting your GPUs—which Stanzione said now cost many times more per ounce than pure gold—sit around idle waiting for data to show up across the network. “Clearly you should perhaps balance your investment between network and eight-times-as-much-as-gold GPUs.”  

NSF’s Leadership Computing Site 

With AI calling the shots on hardware and compute architecture, costs will continue to rise, and Stanzione warned that AI innovation is “increasingly locked in very few hands.” That’s where TACC’s open science mandate comes in handy.  

“AI is awfully expensive. Part of the national AI infrastructure—part of our mission—is to sort of democratize this access and make it so that we can make AI innovations in the sciences without requiring hundreds of millions of dollars of infrastructure,” he said. “But we share freely with all of you.”  

TACC is now host to the NSF’s Leadership Class Computing Facility, with construction expected to begin in June and over $1 billion in federal investment over the next decade in systems, people, data centers, and software. The Leadership Class Computing Facility will be the cornerstone in the NSF’s hardware investment in the National Artificial Intelligence Research Resource Pilot Program. The National Science Foundation calls NAIRR, “a vision for a shared national research infrastructure for responsible discovery and innovation in AI.” 

Stanzione called it the National Science Foundation’s “consistent investment in computing infrastructure that will be around on the multi-decade timescale to match the instruments and the datasets and the things we need to do to continue to move computational science forward.”