Eli Dart On Science DMZs, COVID-19, And The Future Of Computing Infrastructure
May 11, 2020 | TRENDS FROM THE TRENCHES—At the Energy Sciences Network—ESnet—Eli Dart works to help scientists, science collaborations, and facilities more effectively use high performance networking in the conduct of science. It’s a job he loves, he says, allowing him to focus on architecting and designing high performance science, conducting problem analysis, and developing design patterns for best practices.
Stan Gloss, founding partner at the BioTeam, recently spoke with Dart about his career in network development and the concept of the Science DMZ, which he introduced in 2011 and Bio-IT World was invited to listen in.
Today the Science DMZ is a foundation on which we can build better networks, Dart said. The fundamental goals still apply: a network architecture explicitly designed for high-performance applications, where the science network is distinct from the general-purpose network; dedicated systems for data transfer; performance measurement and network testing systems that are regularly used to characterize the network and are available for troubleshooting; and security policies and enforcement mechanisms that are tailored for high performance science environments.
But Dart isn’t territorial. “If somebody can come along with something better, let's all jump to that!”
Editor’s Note: Trends from the Trenches is a regular column from the BioTeam, offering a peek behind the curtain of some of their most interesting case studies and projects at the intersection of science and technology. You can watch a video of Stan and Eli’s full conversation at http://biote.am/dmz.
Stan Gloss: My first interaction with you was a number of years ago, around a concept called the Science DMZ. How did we get to a Science DMZ? And what drove us to get there, Eli? Why did you have to design that?
Eli Dart: The roots of the Science DMZ model came out of two things. One was my work as one of the network engineers for NERSC, which is the Department of Energy’s flagship high performance computing facility. I worked there from 2001 to 2005. My colleague Brent and I spent a bunch of time and effort thinking about how the internet protocol stack mapped on to hardware, and what we needed from hardware in order for those protocols to run consistently well, and consistently provide high performance service to the applications that used them. And so, we rebuilt the NERSC network from the ground up, with that performance engineering in mind. And not just build it for performance, but as we all know, there's an operational aspect of this. It has to not just be possible for it to work well, it has to work well in the general case, without a lot of human babysitting involved. I've got better things to do than fix something every time I turn around--we all do. And so there's really two pieces to that. One is it has to be able to perform well. And the other is it has to be able to perform well in the general case, without a lot of human input.
When I moved them from NERSC to ESnet, that's going to the other side of the fence. ESnet is NERSC's network provider. So, I went over to the other side, and saw the issues that we were resolving at NERSC writ large throughout the scientific community. And as part of my early work at ESnet, I conducted a large scale requirements analysis, that eventually grew into ESnet's formal requirements program, where we looked at the network needs of a bunch of different science projects, facilities, and disciplines. And we came up with several common elements. That commonality, combined with my experience at NERSC, made it clear that this was a generalized problem that was only going to be solved in a reasonable way if everyone took similar action.
That's where the idea of a design pattern comes in. Most people in technology are familiar with the notion of a design pattern from a series of programming books written a number of years ago. But the idea of a design pattern is: there's this set of design elements that hold true, regardless of detail. And you end up with this well-configured set of Lego blocks that you can use to assemble a solution. But the solution has these common themes throughout it, and the themes are things that perform well, or operate well, or behave well, in the general case.
In the Science DMZ case, it's providing a loss free IP layer, so that TCP performs well. It's a way of connecting a high performance data transfer host to that infrastructure. It's a way of securing that enclave effectively. It's a way of assessing, and assuring the performance of that enclave, and being able to troubleshoot it quickly, and isolate problems quickly if they do occur. And guess what? They do, because networks are complicated. It needs to be able to operate well in the general case, not just operate well once.
The Science DMZ is a design pattern encompassing those elements. I can be deployed effectively for a small enclave, at a small university, all the way up to an LHC Tier One, where you're literally just throwing petabytes of data on the wire, to a supercomputer center. Lots of different environments.
What were the several things that you mentioned that were common that you found across?
So there were a couple of things that were pretty clear across science disciplines. One is more and more science disciplines were getting a hockey stick in their data curve. This is because they're either deploying new detectors, or new sensors, or they're deploying improved computing algorithms that allow them to do things they were never able to do before. They're using high performance computing for the first time, or maybe they're using a new generation of machine that allows them to use a much finer grid mesh, or something like that, and datasets explode when you do that.
But me, personally, I do not scale exponentially. What we need is a way for the tools that people use to learn things and understand things to be able to scale with the datasets that they're using, and that they're working with.
Related to that, people always talk about either you move the compute to the data, or move the data to the compute. What we learned early is that we have to support both in the general case. Yes, you move the compute to the data where you can, obviously; it's by far the most efficient thing to do.
But if your proposal for however many million hours of CPU time on a national supercomputer center is accepted because it's meritorious, you get an allocation of storage and compute on a named resource. If your data is not available to an analysis code executing on that named resource, the first thing you have to do is get your data there. If that named resource is the Summit supercomputer at Oak Ridge, or the Cori supercomputer at NERSC, or Frontera at TACC, you can't move the compute to the data, because you can't move the compute.
This combination of factors made it clear that, in order for the rest of the science complex to be productive, we as the networking community had to come up with a way to enable scientists to productively do arbitrary data placement at scale. Because that's the only way they're going to be effective in extracting the scientific promise from all of these different things that they have. They need to assemble those things into a coherent whole, which they can drive as one distributed machine. Because that's the only way they're going to solve these really large scale science problems.
Right. We've implemented a Science DMZ both at the NIH, and also USDA. For example, at USDA there are researchers all over the network doing cow genomes. You don’t need a cow genome in every location, you just need quick access to the data that is in the system. There are some kinds of efficiencies we've gained, not only around the networking and the movement of data, but the storage, and efficiency of that.
Yes. So, I think you're touching on a really valuable point here. If I can easily get the data from a data store, I can pull the data down to scratch space. I can run it through my analysis function, and extract what is, in most cases, a much smaller, higher value set of data products from that initial analysis. I can now delete my copy of the raw data, and because the custodial repository still has it, I can get it back whenever I need to. I can now proceed forward, working with the smaller, higher value data products that take less resource to store locally.
Now, the side effect of this is, as a researcher, I don't have to hoard data on my local storage system, because I know I can get it from the repository of record, on human reasonable timescales, should I ever need to do that analysis, or a related analysis again. In environments where it's straightforward to do that, people don't have to hoard their own copies of their own data. In environments where it's very difficult to assemble a dataset, yeah people are not going to give that data up. And so, you end up with copies of this data taking up space on file systems, or storage systems, scattered throughout the science complex. That's not good for anybody.
This idea of making centralized data resources easily available to compute, wherever the compute may be, in whatever way is efficient for the scientists, that's a key enabler for research productivity.
Say I'm one of those USDA researchers that wants to get access to that cow genome somewhere. Maybe I'm in Omaha, Nebraska, and there's data in Albany, California, and I want to be able to access that. So how does the Science DMZ help me do that, in non-technical terms?
Researchers shouldn't have to be exposed to the sharp edges of technology in order for them to be productive. The Science DMZ would allow you to have an easy-to-navigate interface to a high-performance tool that allowed you to place the data that you need next to the compute you're going to use for analysis. And here's the data portal where you can find that data, and this is the mechanism by which the data will come out of that portal, on to the storage system that's next to your compute.
Now, I would go and tell the System Administrator: For the back end system that that scientists will use, you need to have a high performance storage system, front-ended by some data transfer nodes (DTNs), and the exterior-facing network—not the interior facing—interfaces to those DTNs in the Science DMZ. The data portal software that you use for search and discovery in that storage system has to be coupled to those things in a way that makes it easy for the scientist to access them.
Next, I would tell the Network Engineer: the IP layer between those DTN interfaces, and whatever the DTN is on the other side, which is next to that scientist's computing resource has to be loss free, and you, Network Engineer, need to have mechanisms to ensure that it is loss free. If it's not, you need to find out before the user does and localize the problem, so you know which administrative domain actually owns the poorly performing pieces, and you can fix it.
Then, I would tell the CIO: this stack, this set of things is going to make their science users more productive. And that instead of being a cost center, like so much of IT, this is an enabler, an accelerant for the primary mission of the organization, which is scientific discovery. And being able to operate this effectively, and operate it safely, which means sane security policies is really important in order to keep this research organization on the cutting edge going forward.
Finally, I would tell the security engineer: there's a set of applications that have to run here, and you need only those applications to run here. If anybody tries to run anything else, lock them down, and shut them off. Because the DTN is a specialized device, it should only run certain tools and protocols, it should speak that API, and nothing else. The Science DMZ is one coherent thing, but different people have different missions to fulfill, and different portions of the space that they have to cover, in order for all this to work together as a coherent whole, to make the scientists more productive.
Is what you described the current state of the art of Science DMZ? Is this the current thinking?
At this point, I think of the Science DMZ as a foundation and we're looking to build on top of that foundation. People are realizing that they need to collaborate, and they need to share data.
The vocabulary we use is data portals, Science Gateways, or Data Commons. It's a set of data, or a repository of data, along with its metadata, and the scientists can search those metadata elements to figure out what data sets they want to work with. Whatever suits their analysis, or their project, or their line of discovery, and then access those elements that make sense for them to access.
There are a couple of examples of this. For Earth Science, there's the Research Data Archive at NCAR, which has literally petabytes of data, and there's a nice search interface. It's backed by a Science DMZ, so when the web browser hands you a pointer to those datasets, you have the option of using a high-performance tool to place those data sets at a high-performance computing facility, for example. Because I tell you what, downloading 50 terabytes to my laptop over my home Wi-Fi is not a useful exercise. Taking that 50 terabytes and placing them on scratch space at a high-performance computing center, where I have millions of hours of CPU time, that's great.
On top of this foundation layer of the Science DMZ, we build larger structures that are geared toward scientific productivity. And the hope is that if the systems, and networking people have done their job well, integrating those data discovery, and data publication functions with the Science DMZ, the scientists should never have to know. And all that they know is that that portal goes fast, and that's the productive one to use. So that's the one I'm going to use to get the datasets out, and put them over here.
What piece of the Science DMZ is most exciting to you moving forward in that model?
To me, the thing that's really exciting in that model is the idea of being able to interconnect three things. The data that the scientist wants to analyze, the computing that they're going to use to conduct that data analysis, and the scientists themselves. If all you have is data and computing, you don't have knowledge. Those things have to work together, at the behest of the scientist who is using this distributed machine to ask the next question, and to build human knowledge and carry it forward.
The Science DMZ is just the best version of this we've got right now. If somebody can come along with something better, let's all jump to that. I don't claim that we're done, but it's good enough that we can build this next layer of things on top of it. And so, what I hope to have is a future where scientists have an easy to use interface to computing, and they have an easy to use interface to data, and they have an easy to use tool chain that allows them to fuse those two things together in a way that makes them productive, regardless of data scale.
I think the Science DMZ model is an enabler for that. But the Science DMZ model cannot do that by itself, because what the Science DMZ model gives us is the ability to do high performance data placement. It doesn't give us those higher level things, that are domain specific, or the property of a data hub, or the property of a Data Portal, or a high performance computing code, or all the other things that a scientist will use. It's just a high performance foundation upon which we can build those things.
We all know that something's going to change. The idea is to get the architecture right, so that you can change out pieces as you need to. If you come up with a better data transfer tool, you ought to be able to change that out. You come up with a better storage system, you'd better be able to change that out. You come up with a new computing paradigm, whether it's quantum, or whatever the accelerator is, after GPUs, or the new spiffy GPU, whatever it is, you need to be able to change that out.
But the human structures around this larger framework, where we've done the right problem decomposition that allows us to scale the pieces that we need to scale, as the datasets increase, and as the computing capability increases, that's what's important. Is getting this set of constructs harmonized in a way that we can scale those up.
The quip I like is, "Technology changes quickly, and people and protocols change slowly." People take a long time to change. And the TCP IP protocol stack has been with us for decades, it's unlikely to go away. So what we need to do is to figure out a way to productively use the protocols we have in a rapidly changing technology environment to make people as productive as we can.
What you're describing is almost like an infrastructure ecosystem that's modular. The parts that you can change out as technology moves around, you're not wedded to any one technology. As long as whatever that technology is using, the data, you've had good data structures, and good data hygiene all along, that all that should move. But we're also talking about using the portals, and hubs, as ways that we can get the people on the other side also collaborating, and working, and all of that. Is that where we're going in the future?
I think so. As we get better at technology, the speed at which the technological portions of a workflow will happen will increase. The gating factor is going to be the human.
We're very much seeing this rise of interdisciplinary science teams, for example. And so, a bunch of that is an admission of, or a salute to, the fact that many problems are more difficult than one person can solve. And so, if we can make it easy for people to each bring their own expertise, to the analysis of a dataset, or to the formulation of a pipeline to solve a problem or whatever it happens to be, that's where we will be very, very productive. And this is straight out of the world that I live in, where big team science is one of the hallmarks of the DOE National Laboratories. It's one of the reasons the labs exist, is to be able to bring big, interdisciplinary teams together to tackle big problems.
And so, this idea of making teams of people productive, by making the technological parts friction free, is, I think, hugely valuable. And it's a huge competitive advantage, I think, to any organization that can get that right: people can spend their time collaborating with each other, and working on the science, rather than hassling with the technology all the time.
Sure. That's part of the reason people have gone to the cloud. They wanted some of that hassle-free stuff. Right now we're seeing that level of collaboration in the middle of this COVID-19 pandemic. We're seeing collaboration on scales that we haven't seen. They might be effective, they might not be effective, but they're definitely happening today, and every day now, at a scale that we've never seen. Do you think it's going to stick? We have a common foe today, but could it go back? Could we go back to our silos? Or did this completely blow up what we're doing, and this is going to create a significant change in mindset?
I think that's a really good question. I mean, I could be flippant, and say it's difficult to make predictions, especially about the future. But, I think it really is worth thinking about this, because I think it has some implications for the technological environment that we all work in. I think that many scientists, and many disciplines, will, when all the dust settles and we finally have a therapeutic, and we finally have a vaccine, and the darkness of this can be put behind us, we can then look at what happened, and try to learn from it.
I think one of the things that we're going to see happen, when we look back, one of those things will be, as you describe, huge numbers of people coming together, using technology in a way that allows them to be productive quickly. That's going to stick, I think. And I think this is going to be a real wake up call to some people who hadn't necessarily thought about it before, of the value of technology enabled rapid collaboration.
Some groups are going to harness that very effectively. I think there are going to be some groups that want the protection, and exclusivity that comes from a silo. And I think it's going to be an open question as to whether you need the silo to provide that isolation, and exclusivity, or whether we can apply effective policy controls, say embargo before publication, or something like that.
I think there's going to be a question of, can we take the obvious wins from these rapidly assembling big teams that use technology to be effective quickly, and use that same technological substrate to provide individual investigators with what they need, in a way that still allows the larger community to make use of their data when it's appropriate to do so. And that would be my hope. My hope would be that we could all come together in a way that made that easy enough to do, that we could all see the value from it.
I'd like you to go back, say 10, 15 years in your career, and I would like to have you think about advising yourself with the knowledge that you have today. 10 years ago, what would you have told yourself, to prepare for a think about that you now know and gained as truth because of your experience?
Let's see. I think that one of the important things that I have learned is that the interaction between people and technology is critically important. When we're very technology focused, which my younger self very much was, it's easy to get lost in the cool factor of the technology, because a lot of it is really cool. And there's nothing wrong with that at all, because people who get lost in that, tend to excel at it in a lot of cases, and they build systems that are amazing.
But the real value in those systems is what people can do with them. And the relevance of those systems is in the context of people. That's been one of the big growth factors for me, is the full realization of just how much the people matter in this, and the relationships between people, the relationships of people to their data, or to their science. And those relationships drive how technology is used, and therefore how technology is effective in the ways that actually matter, which is how it matters to people.