Five-Year Plan: The Changing Landscape of Science, Storage

May 18, 2021 | Glenn Lockwood serves as a storage architect at the National Energy Research Scientific Computing Center based at Lawrence Berkeley National Laboratory. In this role, he’s had a front-row seat to the changing demands of science on storage.

“I work in the advanced technologies group, which does a lot of future storage architecture, design and evaluation work,” Lockwood said, “Prior to that, I was the acting group lead for our storage systems group which is tasked with actually running the production storage systems and designing more near-term storage. This is not only really fast, really high-performance storage, but also the really stable tiers where people store their data and trust that it will stay there uncorrupted for decades.”

As science changes, what we ask our storage to do changes. And as computing advances, what storage is capable of evolves as well. Lockwood has been watching this growth and working at NERSC to balance the sometimes-conflicting demands of science and technology.

Stan Gloss, founding partner at the BioTeam, recently spoke with Lockwood about this ebb and flow. Bio-IT World was invited to listen in.

Editor’s Note: Trends from the Trenches is a regular column from the BioTeam, offering a peek behind the curtain of some of their most interesting case studies and projects at the intersection of science and technology.

Stan Gloss: You’ve held varying responsibilities at the National Energy Research Scientific Computing Center (NERSC). You’ve got a broad view of storage now. What are your current responsibilities?

Glenn Lockwood: I help lead both planning what our center's future storage needs will be in the next five years and executing on the strategy that was envisioned five years ago. This involves evaluating the technologies on which we will rely to meet growing demands from new workloads like cryo-EM while also working with the teams deploying and managing systems now and in the near term on a day-to-day basis. This means we're continually doing design, integration, and retirement across the center's storage: all the way from flash and NVMe down to our giant tape archives.

That's really interesting. What are you now looking at trying to manage storage from multiple generations in different formats—from spinning disk to tape to flash? What are the challenges that you see trying to do that kind of work?

Because we touch everything from the really fast to the really slow, there is, in my mind, a perpetual tension of balancing risk and innovation with stability. When you're deploying extremely high-performance storage, you can take big risks in technology because you accept that if you're putting something on the floor that is the fastest of its kind, you'll probably hit problems since you're the first to do it. As a result, we discourage our users from storing the only copy of their most critical data on the all-NVMe file system. But for our tape archive, though, there's an expectation that if they put your data in there, it will be safe forever.

And so there are questions. How much risk do we want to take in using the more advanced features of our super-reliable tape archive, for example? How much performance on our extremely high-performance flash tier are we willing to give up to give it just a little bit more resiliency in case something does fail? What's our appetite for a risk of data loss? Will we take 50% fewer IOPS or 50% less bandwidth if it means we can replicate all the data on that tier? What are we willing to give up and where do we put our money in terms of risk versus stability?

So managing the whole stack in any given day, we might have to think about two different storage systems with completely different balance points or risk versus performance and innovation. It's a perpetual struggle to figure out the right balance for every single tier within the canonical storage pyramid. Then even taking a step further back, how much money do we want to invest in the fastest tier, the middle tier, and the slowest tier? Do we want to put all our money into delivering the first-of-its-kind world class, fastest file system in the world? Or do we want to put our money into making sure that users will keep storing their data safely for decades to come and make sure that that data is indeed going to be there when they need it, whenever they need it?

We're all operating on a fixed pool of resources. It's always a balancing act of money, performance, resilience, stability, and things like that. It's tough. Those are not necessarily technical challenges, they're philosophy, and how do you want to approach managing the full data stack in a center?

How much do you need to understand the use cases of the end users to make decisions like that?

It’s absolutely critical, because there are very concrete design decisions that you can make to give up, say, bandwidth to give you a lot more IOPS. Bandwidth is what your large-scale traditional modeling and simulation user wants, whereas IOPS is what the 100 users who are running on 10 to 20 nodes at the same time want.

If you're designing a center for users who only run one or two big jobs on the system at once, you don't really need those IOPS. But if you're running a center, like we do at NERSC, where we have 7,000 users and they're all trying to do different things at the same time, designing a system for purely bandwidth just doesn't really meet the needs of those users. It makes everyone equally unhappy. But you could make them all less unhappy if you design the system that gives up that hero number of bandwidth and instead delivers better IOPS. You can do that using different data protection schemes and different media, for example.

I think that's something that the HPC industry has been slow to acknowledge. Supercomputing has historically been about that big number, that hero run. How many flops can your system do? What's your peak IO bandwidth? But in reality, that's becoming less relevant to the workloads that large scale supercomputers are seeing.

This only gets worse when you have the onslaught of things like AI workloads, which, unlike traditional simulation, are no longer just writing one big output from every single compute node at the same time. They're reading in micro batches, they're randomly seeking around to get good sampling when they're doing learning, and they're not write-heavy, they're read-heavy. They're consuming data to produce a much smaller amount of output. These are workloads that are new to traditional HPC, and they require a very different approach to designing parallel storage.

A behavior that I saw in some of the commercial companies that I work with is that scientists are focused on answering the current question that they're looking to answer today, and they don't give any thought into secondary use of primary data. I've had people tell me that anywhere between 80% and 90% of that primary data is used once and never touched again. What’s your experience with that? It seems pretty expensive to do that.

We have a tape archive for that, and it incentivizes the behavior that you've described. And there’s something about tape: the odds of the data being reused are not huge. Even if you've tried to get old data out of the tape archive, it might take an hour for it to come back. By then you might have forgotten about that idea and moved on to something else.

I can only speak for myself on this, but I do try to publish data that I think would be useful for people. When I started at NERSC, I didn't have root access; I was the equivalent of any other user, and so I didn't have access to any system telemetry. I didn't know what the systems were doing, but it was my job as a system architect to design systems based on how they're used. And so whenever I'd get my hands on a big chunk of operations data from one of our systems from one of the engineers who took an afternoon to collect it for me, I felt an appreciation for how valuable that was to me and could imagine how valuable it would be to other people.

Over the years, it's become easier for me to get the data that I need myself, but I try not to lose sight of the fact that a lot of the data is probably useful to other people. My personal rule is that if it doesn't hurt anybody to put it out there—it's not too much of a burden in terms of de-identifying or clearing intellectual property issues—I share whatever operational data I can.

Some people have come back to me and said that they have actually picked up and used datasets that I've just thrown over the wall. Not many, so I don't know if it's actually making the world a better place, but as long as it's not a huge amount of effort to keep publishing data, I certainly see a lot of value in it. But there are roadblocks like intellectual property and copyright that can make it a little annoying to share data that you would otherwise just let sit on a file system or in a tape archive. And so unless there's a good forcing function to get you over that barrier and do the legwork to go to the IP office and get a dataset properly licensed, I can see why people wouldn't want to do that, especially in a very fast paced, deadline-driven world, such as academic research.

Putting together a three-to-five-year plan, how do you plan for the evolution in workflows? For example, the point at which AI and machine learning moves out of its current nascent to mid-range scale. Eventually there'll be a turning point at which it's used broadly.

Yes, a turning point is definitely coming, and we have to take it into consideration. It's very challenging though, because there's not a lot of precedent for understanding or rationalizing what the ramp up of AI in HPC might be. But just based on how much AI and machine learning applications have grown over the last five years, and what proofs-of-concept at scale have shown and what the gap between the average user and those hero-sized, whole-system proofs-of-concept have been, I think we can see big changes coming long before they arrive.

That said, NERSC specifically has been in a really advantageous position because our mission isn't only about supporting full system simulation jobs. We've always served a very broad user community; we have 7,000 users, and we've often got hundreds of jobs running on our system at any given time. And so that aggregate workload of a bunch of people doing a bunch of different things all at the same time is not that different from what a large-scale AI workload may look like. We've already been going down this path of building storage systems that are more capable of doing read-heavy, IOPS intensive workloads and serving users who are not always doing things in the traditional way simply because that's been our mission: we provide resources for more than just extreme-scale simulation.

I think we're positioned well, and this makes it easier for us to forecast in the future how much different tomorrow will be from today simply because we're halfway to tomorrow already. With the workload diversity, and how atypical our I/O workloads are compared to other centers that run a small number of very large jobs.

So in making these plans, what interesting technology are you evaluating these days? There are all kinds of new storage systems coming out: VAST and Igneous and all these companies. It seems like everybody's moving to flash storage.

We've been in the NVMe game for at least five years now. Our next big parallel file system which is coming online at the end of this year is 30 petabytes of usable all-NVMe Lustre. It's exciting in the sense that this is going to be the first of its kind at this scale, but I suspect that many HPC centers won't look like ours does at the end of this year. Instead, they'll get there in three to five years as the price of NVMe keeps dropping.

I definitely think that flash will become the norm for anything requiring performance. There's just no reason to use spinning disk for anything that requires IOPS or even bandwidth to an extreme degree in the next three to five years.

Storage class memory—primarily 3D XPoint—it’s interesting, but it seems like nothing huge is really happening with it outside of two technologies: VAST, which you mentioned, and DAOS, which is an open source, object or key-value store kind of system that's being developed by Intel and is being deployed with the Argonne Exascale System.

I think almost every vendor that I talked to about storage class memory says the same thing: they don't really want to jump into the pool until there becomes a clear market for it. Intel was out first with it, and they're trying very hard with DAOS to show the benefits of it. But storage class memory is a single vendor ecosystem, and it's just not seeing a whole lot of demand because there just aren't a lot of reasons to go and buy a ton of it right now.

We at NERSC haven't looked at XPoint or any other non-volatile memory technologies very much because I'm not really sure what we would do with it if that others haven't already done. It would take a really strong reason for us to abandon all the traditional storage software that we've known like Lustre and GPFS to jump ship and buy into something that requires storage class memory, and that killer user case hasn't appeared yet.

That said, it's pretty exciting.

The numbers and the technology and the techniques that VAST has introduced, they're really interesting. The engineering there is really compelling. Relatedly, WekaIO is thinking about parallel storage in a very different way, which is pretty commendable. These companies are not optimizing only for bandwidth, which is what Lustre and GPFS have historically done. They're delivering on IOPS and metadata performance, the things that matter when you are doing anything scale out that is not a single large-scale simulation. Those are both pretty promising.

Editor’s Note: WekaIO is hosting an upcoming Bio-IT World webinar on accelerating data pipelines with nimble storage solutions.

But at the same time, at least to us at DOE, parallel storage is a pretty slow-moving body. Even though these are exciting new technologies, it would take a lot for us to jump into a completely untested and new technology at scale, because Lustre and GPFS today are solid choices in most cases. We're buying 30 petabytes of NVMe, but we're slapping tried-and-true Lustre on top of it because we really need the file system to work when it's delivered. We are a data center, and so if we don't have really solid, well working storage infrastructure, we can't deliver on our mission.

What are some of the major complaints that you hear about storage from the end users?

Every year we send out a user survey, and one of the top complaints overall—for the entire center—is that the file systems are slow. And if you scratch the surface of that, what they really mean is that the file system performance is extremely unpredictable. I see this even in the little amount of logging in that I do. You log in, and the time it takes for you to actually get to a shell prompt or command line can vary from instantaneous to a minute because some dotfile in your home directory that needs to be read is hanging.

It's the unpredictability, the lack of any quality of service, and IO variability that really frustrates users. And ensuring some kind of quality of service and predictability of IO performance is an underserved aspect of parallel IO performance.

How do you fix that?

This is where you have to make a conscious architectural decision to take tradeoffs. Historically, the reason why quality of service has been nonexistent for these parallel file systems is simply because to get any kind of quality of service guarantees from a storage system, you've got to effectively take away performance somehow, so that you have some reserved capacity for people that really need it. That has not really been an option. Anything that detracts from the performance of the file system is not worth doing because it's a high-performance file system. But I think that mentality is changing.

To make things more predictable, you've got to make everything from client to file system predictable, or you have to have at least some way to control service. That part is relatively easy for us on clients and the compute nodes, because we don't have a lot of multi-tenant clients; our users request a number of nodes, and they're the only people on those nodes. As soon as they leave the compute node to do IO to the file system though, they're going over the network, and historically, network congestion is a major source of unpredictable IO performance.

Our next supercomputer, Perlmutter, will have Slingshot, which is HPE’s Ethernet-compatible interconnect. Part of what Slingshot delivers is defined qualities of service that allow us to give certain traffic some measure of guaranteed performance. For example, we could put a job's IO traffic in a traffic class that has some guaranteed minimum bandwidth to make sure it doesn't hang when other people are also hammering the network. This sort of network capability will hopefully address some of the unpredictability users face today.

On the actual back-end file system side, disk was historically a big problem area for unpredictable performance because the disk is a mechanically spinning thing. If your data is striped across a bunch of disks and someone else is also using a disk to read or write their data, there's not a whole lot you can do to stop their IO from affecting yours.

Flash eases that quite a bit, because flash gets a lot more performance overall and it's capable of handling much more parallelism just by the architecture of NVMe drives and NVMe protocol. The file systems that are built on top of these storage media are also adopting quality of service controls to an extent. Lustre, for example, has token bucket filtering that gives administrators some knobs to turn so they can restrict how much of the total file system performance any one user can consume under contention.

We're getting there, and I think the hardware and software technologies are both converging into a place where we can actually get true end-to-end quality of service.

In my experience at BioTeam, science and technology actually speak different languages. Scientists speak the language of science: the scientific method which is experimental. IT tends to speak the language of engineering. Do you see that difference? How do you interact with those two groups?

We do this at a couple different levels. We do have a model somewhat like you have at BioTeam, where we have a lot of “fallen scientists” as [BioTeam consultant Chris Dagdigian] calls them on staff. We have a number of physicists; I'm a materials scientist; we have mathematicians. We've got people who come from the world of science, but for one reason or another, have moved into the world of computing and information technology. And so at a very fine-grained level, we just have a lot of people across the organization that understand both sides of computing. They understand where the users are coming from and their mental position when they ask questions. But they also understand the pure engineering technologists behind the scenes who view the world and their job and their systems the way that they do.

From that pool of very diverse people, we also build up the organization such that we effectively have user-facing staff, and hardcore back-end engineering people on different teams. I'm not sure if this is a good idea or a bad idea or if it doesn't make a difference, but it's just kind of naturally how things have fallen out.

Our frontline support team knows a lot of the users through interactions with them, and it makes it a lot easier if the scientist at the other end of the ticket, who's trying to get something done, knows that they're talking to a fellow physicist when they file a ticket with us. That support team also happens to know the engineering staff in the storage group. So they have no problem saying, "This user has this concern, is there any way we can accommodate that? Or, “Is there anything we can do to fix this?” Or, “Can we make this person's life better by doing something?"

By having a gradient of a pure scientist on the user end to pure engineer on the back end at NERSC, we have different layers of people who are at different balance points of engineering and science in between. Some people have their biases towards science and others have it towards technology, but realistically you can't find individuals who understand the full stack with expert-level knowledge. It just simply doesn't happen. So we hire people who can do parts of both and then compose teams in a way that makes it a pretty even gradient going all the way from the scientist to the engineer.

It sounds like you traditionally build multidisciplinary teams to attack these kinds of problems.

Yes, absolutely. It's certainly a good idea to try to do that, but it also seems to be something that happens naturally, and it's been working.

It seems like the problems that we're being asked to solve now with the amounts of data to solve them, have become extremely complex. That the future probably brings us to a point where we'll, because of the complexity of the problem, that we're going to need a broad range of disciplines and knowledge experiences, expertise all on the team. And building these integrative team is going to be very important for the future. What do you think?

I agree 100% with you. Ultimately things like data science or supercomputers are just tools, and you can't build a tool and expect people to use it or know how to use it productively. You do have to build a tool with users in mind, and if it's a complex tool, you often have to work hand in hand with the person who's using your tool in order to get something great to come out of it.

In some places, there can be a culture of people who have long been the first of their kind to do something, and there's a mentality amongst many staff that everyone can just figure things out, build things themselves, and do the full stack. In some scientific disciplines, arguably, that's still doable. As a physicist, you could build an instrument that's world class without necessarily being a mechanical engineer as well. But computing has gotten sufficiently complicated that I just don't think that's possible anymore.

You can still be very productive as a single person army for sure, but to make the most out of computing, all the way down from the individual hardware components to system software to the data analytics tools that are built on top of that, it's really difficult to do it by oneself. You really do need to have a diverse team of people who have skills that are complementary but not completely overlapping.