Hints On Data Storage And Transport From High Energy Physics

March 2, 2020 | It may be surprising that with a PhD in biochemistry, Brigitte Raumann spends a lot of time these days considering data storage and transport in high energy physics and astronomy. But that also puts her in the perfect position to recognize some of the lessons from those disciplines that the life sciences can pick up.

Raumann is the product manager for Globus, a research data management department of the University of Chicago. As she has worked on data storage and transport issues for many disciplines, she sees opportunity for the life sciences in automation, identifying—and removing—the mundane tasks that are still bottlenecks to research. Researchers should not be solving IT challenges, but should be solving bigger biomedical challenges, Raumann emphasizes.

Bio-IT World sat down with Raumann to talk about the challenges she sees in data management in the life sciences and the solutions available.

Editor’s Note: Raumann will be speaking at the Bio-IT World Conference & Expo, April 21-23 in Boston, both during the Data Storage and Transport program.

Bio-IT World: Globus was a Best of Show finalist at Bio-IT World last year. Can you tell me a bit about the service and its goals and how it works?

Brigitte Raumann: Globus is a service that provides secure, reliable research data management tools, by that I mean services to move, share, and find your data. Of course researchers have their data in all sorts of different locations. They use anything they can get their hands on, right? They've got it on a server under their desk or in the lab, they've got it on their laptop, they might have it at a supercomputing center, they might have it at the sequencing facility, or at a collaborator’s site. And all of that data is on different storage platforms. You need different accounts to access it. You need to know different technologies because some of the data's in the cloud, right? Some is on your laptop.

Globus provides an overlay across all of that data so that you can see it and manage it with one interface. We provide a web interface that you can manage that data. You can use your institutional log in to manage the data and it's secure and very easy to use.

We are wholly part of the University of Chicago. So as a project at the University of Chicago, we're nonprofit of course. And we serve the nonprofit research market. That's really our intention, to advance science by letting researchers do more research and less worry about the IT behind managing big data.

Globus grew out of a need in the physical sciences to manage very large data sets well before many biologists moved into big data with the genomic era. But the Globus platform is generic enough that you can apply it really to any kind of data. We're seeing more and more adoption in biomedical sciences because, of course, biologists are struggling with large datasets just like anyone else right now. And it's particularly well-suited for the biology market because individual research labs don't tend to have the IT support that a big high energy physics project like CERN or some of the large telescope facilities would have. They have a lot of IT support in the physical sciences, whereas in the biological sciences, those resources just aren't there.

So people need something very user-friendly and easy to use and doesn't take a lot to maintain or install or operate. So that's one reason why we operate Globus as a software-as-a-service.

Is storing the data, moving data, is that really a bottleneck or a pain point to scientific progress in the life sciences?

It is! It's so mundane and you wouldn't think it is, but in fact it is a big struggle and it really shouldn't be, right? It is not exciting or sexy and really is not what researchers should be thinking about at all. But it is a bottleneck. In part because of the large sized, you can't email it, you can't use HTTP to upload or download it. A lot of the people who use Globus, if they weren't using Globus, they'd send the data on a hard drive. And that's problematic because physical media is hard to track, it's not clear if the data's been corrupted by the time you get it, it's not even always clear that what you received on the hard drive is it everything that you were supposed to receive. Globus takes care of all of that tracking. It checks file integrity after the transfer, and I would argue it provides more security than shipping a hard drive does. Globus also meets HIPAA security standards, which is a big advantage to many of users in the biomedical research.

Data management is often difficult because you're moving data between different types of systems, right? You want to move your data from Amazon to your Linux cluster. How you would actually do that is not clear to an average biologist. It is not point and click; it's not drag and drop. But it can be if you use Globus. So it is, unfortunately, a bottleneck and our goal is to remove the bottleneck and shield the researcher from the complexities of managing data across different storage platforms.

Cloud storage is sometimes portrayed as point-and-click. Do you find that to be true? If not, what sort of new challenges do we have because of this mix of on-premises and cloud storage?

From my anecdotal experience, I think there's a learning curve for a biologist to adopt cloud resources for sure. Even if it's easy to use once you're in the cloud—and that's debatable, but let's assume it is—you still need to get your data in and out because the data is generally not produced in the cloud, right? It's generally data that's coming off of, for example, an instrument like in the case of images and sequence data, or from electronic medical records. And then of course, if you're doing your analysis and storage in the cloud, eventually it's got to come out of the cloud or at least move within cloud storage systems. Either to get published into data repositories, public repositories, data portals, published in journals, or sent to collaborators, what have you. So it's not a place where your data is going to stay. You need to get it in and out.

And don’t forget the cloud usually means a researcher needs to create new identities and learn new ways of managing permissions and security requirements. And an average researcher is already juggling 10 different accounts, right? They have their account for the hospital they’re affiliated with, at the university they work at, they've got their Google accounts, they've got the temporary account on collaborator’s storage, and now they've got their Amazon accounts. That's just one more mundane thing for them to manage. With Globus you don't need to worry about those sorts of things. So I would say I'm all for the cloud and I think it's a great resource, but Globus tries to make using the cloud easier and coming closer to that promise of point and click for the cloud.

You mentioned that actually Globus started off serving primarily the physical sciences. Have you made any observations or learnings from the physical sciences that we need to be adopting or translating or transferring over to life sciences?

That's a good question. I’ve noticed a lot more automation in the physical sciences in terms of moving the data, archiving the data, processing the data, distributing the data. A lot of projects use Globus just for that very reason, because you can use the API or command line interface to integrate Globus into your workflow. I don't see quite that kind of adoption in biology as often. And so I think that's a place where we can really help biologists to apply Globus in order to automate some of those more mundane tasks. For instance, at sequencing facilities, automate archiving data after a certain time frame or archiving it after you've detected that a customer has moved it off of or copied it out of the sequencing facility onto their own resources. Or automate the creation of different folders with the specific permissions that users need.

In some ways it's a little harder to automate in biology because there really aren't the universally accepted processing pipelines that exist for some of the physical sciences. The tools aren't as standardized. There are general best practices and some tools have greater adoption than others, but everyone tweaks their pipelines slightly or uses similar but slightly different tools or comes up with new tools. I think that's both the nature of biological data and where biomedical research is today. You have to account for a lot of exceptions and variations in experimental design and in the samples and in the different sample preparation technologies. And the technologies that generate data, such as sequencing and imaging technologies, change rapidly in biology. That’s one of the reasons biologists need flexible tools that that can be applied to a wide variety of data management challenges.