Learning the Language of Data Science and Pushing Science Forward
August 11, 2020 | As Susan Gregurick moved from academia to the Department of Energy and now to the National Institutes of Health, her views of data and data science have grown. Now, more than ever, she is convinced that data science is not just about technology but also about the people who will readily use and reuse data.
“My biggest advice is to learn better communication with people about data science, so that they can see themselves in the solution that one is thinking about,” she told BioTeam founding partner Stan Gloss. A data science plan—like the one Gregurick recently launched at NIH—will not succeed without investment and buy-in from researchers. “You have to meet people where they are in data science. It's taken me a lifetime to try to learn to talk to people about data science in ways that they want to meaningfully engage with… They want to see themselves in the data solution,” she said.
Gregurick spoke with Gloss recently about her developing views of data science and what she’s learned over her career—and in the past 4 months—about how data can help us push science forward. Bio-IT World was invited to listen in.
In our current COVID-19 reality, Gregurick sees an opportunity for a leap forward in our thinking about data sharing. “I think that we have an opportunity to make radical changes in the way that we share, and the way that we closely hold data,” she said. “I don't want to belittle the fact that we have data privacy concerns, but we have an opportunity now to make a culture change that we should not waste.”
But that, too, will take conversations and listening along with technical advances in privacy.
Editor’s Note: Trends from the Trenches is a regular column from the BioTeam, offering a peek behind the curtain of some of their most interesting case studies and projects at the intersection of science and technology.
Stan Gloss: I'd love to know a little bit about your background. Prior to NIH, I know you spent a little time at DOE. Tell me about that.
Susan Gregurick: I came to DOE from academia. I was a Professor of computational biology and biophysics at UMBC and looking to do something a little bit bigger and more impactful. I was at a review at DOE, and they asked if I'd be interested in working on computational data science within their organization. I tried it out, and it was just so wonderful. I was hired in as a program manager in the Division of Biological Environmental Research, so BER, within the Office of Science. My role was to be the program director for computational sciences within BER, but my role was mostly focusing on the bio-energy sector, in the Bioenergy Research Centers that we had just funded, just as I arrived.
I got to see the launch of those centers. I worked a lot on the computational needs. I worked with the Joint Genome Institute in developing data solutions and computational solutions for JGI. Then, I was asked to think strategically about the integration of data and computing for DOE. That leads me to working with the Office of Advanced Scientific Computing Research at DOE and standing up the DOE's systems biology and knowledge base, KBase.
Wow. How long ago was that?
Oh my goodness. I think it was in 2007.
That was a long time ago to be thinking strategically about data. People were focused so myopically on primary use of data, without the knowledge and thought about the secondary use of data. This Kbase was really, quite ahead of its time, when you did it. Right?
I think it was. That's the beauty of working with the Department of Energy and colleagues at the DOE National Laboratories; they tend to be really ahead of their time in scientific areas. They have a good understanding of the use of data through their involvement in climate sciences and other sciences, such as high energy physics and chemistry. Thus, the conundrum was, "Well, what about biology?"
Biology—the data is so very diverse and coming from so many different sectors that it presented a really unique, grand challenge for the Department of Energy. There were some leading examples that we learned from. There was an example from NIH, caBIG, which we can all think about for good and for bad. There were examples happening at NSF; at that time the big data infrastructure was called iPlant and so we looked a little bit at that. There were some emerging data science initiatives in the UK around Elixir, which is now pretty robust, but at that time it was just in the planning phase.
We had some sister organizations thinking along the same lines: How do we bring really large data infrastructures to researchers in a way that is useful for them? That's really the thought of secondary science, and re-use of data. At that time, there really wasn't the concept, in my mind at least, of FAIR: Findable, Accessible, Interoperable, Reusable.
Would KBase be considered a science gateway, type of product?
Yeah. I think in some way, one could consider it a science gateway. That was certainly how I thought of it, and how we envisioned it. It presented some really unique challenges, and some amazing opportunities, and part of those opportunities was really the work that we did with Eli Dart and ESNet to get large, large, scale data transfer across the country, faster. From Chicago to Berkeley, I was totally blown away with what Eli and ESNet could offer and could do because of our push to really think about data as a high capacity research resource.
You're kind of like a hybrid. You have quite a background in science, but you're founded very well in technology. How has that been for you?
It's been kind of exciting. I've always been a little bit of a closet geek, especially when it comes to computing, and computing infrastructure and chips. I'm by no means a CS expert or technology expert; I'm just sort of a geeky person who really likes to learn about these things and that has helped me scientifically to a great extent.
So now you are Associate Director for Data Science at NIH. What part of your DOE experience do you think was most helpful as you moved to NIH?
Oh, that is a great question because the most impactful things that I did at DOE always had to do with working with a team of excellent people. DOE is really team-focused, and I learned the value of diversity and diverse thinking. Basically, we were all program managers focused on working together to get solutions to really hard problems. That is what I brought to NIH.
I certainly learned a lot of technology when I worked at DOE. I met some really smart, really clever people, and I learned from them, but I also learned how to work in a much more team-focused environment. That's not necessarily the academic mindset. Certainly, that is changing, but when I was in academics, team-focused science was not necessarily the number one priority. I think that there's real value in it, when it's appropriate.
What roles have you had at NIH?
It's really fun to work in any IC [Institutes and Centers] in NIH. There are 27 ICs, and the culture, the mission, the goals, of all of them are different. My experience from National Institute of General Medical Sciences will just be an experience at NIGMS. It couldn't necessarily be extrapolated to another IC.
I first took a role as a Division Director for biophysics, biomedical technology, and computational biosciences, and I led a team of approximately 12 to 13 program directors, and one support staff person to fund just under $500 million of awards a year. With that level of funding in one division, it's a machine. Processing the funding of awards is a machine that is incredibly well worked out at NIGMS, it is probably the number one R01granting institution at NIH. It's wonderful to work and to make changes in science while still maintaining that investigator-initiated, let-1,000-flowers-bloom culture. That is what I learned, and that is why I went there. I wanted to move science forward in ways that would be impactful for very, very many researchers.
I did it in three areas: in technology, in team science, and in modeling infectious disease agents. I never was able to do a big push in computational sciences because I started to work on the strategic plan for data science and really moving data science forward at NIH in meaningful, impactful ways, for many different types of initiatives and programs. Again, my love for data science! It's hard to keep away!
Whenever you track to successful projects, you always track back and find the successful team. Now you've made that transition to your new role. What's the new job responsibility? It's actually pretty big, right?
It's pretty big, and it gets bigger every day. Even though I am an insider at NIH and I've been here for almost seven years, it still feels overwhelming. Like drinking from a firehouse. I have a vision; I have a strategic plan; I have a guiding light; I know what I want to do. Getting a handle on data science at NIH is incredibly challenging, especially for us. We're a tiny little office, so we have to rely on our colleagues to help and to work with us. It's a lot of work. It's not an easy job, that's for sure. Hopefully it's becoming impactful for the community in ways they might not have necessarily expected.
What was the state of data science? Where was it at when you all of a sudden were handed the reins to do it?
I would say that, although it may not have initially seemed so, I think that the foundation of data science at NIH was actually pretty strong.
We had two very different leaders in data science. Well before my tenure, Philip Bourne brought to NIH the idea that data is incredibly important, incredibly valuable, and it needs to be treated like a research resource. It needs to be used, and to use it, you have to find it, and you have to be able to access it. It has to be reusable, and then hopefully, interoperable, so he brought FAIR to us in meaningful ways. If he hadn't come to NIH I don’t think we would have been where we are today. He made a really meaningful contribution. We had a lot of funding activities, and a lot of good science got done, but the most impactful thing that happened was the concept of FAIR as a principle that we can actually implement.
Great. What was the other? There was a second.
Vivien Bonazzi worked really immensely hard on bringing the concept of new types of data infrastructure to NIH, and that was foreseen in her work in the NIH pilot called the Data Commons. Because of all that work that she put in, we have the AnVIL (Analysis, Visualization, and Informatics Lab-space), we have the BioData Catalyst, and we've been able to leverage and use activities in the NCI Cancer Research Data Commons framework, CRDC. Other ICs are now implementing their versions of data infrastructure commons platforms, because of the groundwork that she laid. Those commons ideas were hers, and she really built the infrastructure at NIH in incredibly meaningful and impactful ways.
Since you’ve started, you’ve put forth a strategic plan. Tell me about the plan.
I asked a number of leaders at NIH to think about what it takes to create a data ecosystem that's as encompassing as data can be. The strategic plan has five parts:
- Infrastructure, utilizing cloud computing and high-performance computing to the best of our abilities;
- Data ecosystem, thinking about how we actually make datasets FAIR in either community repositories or generalist repositories;
- Tools and analytics, either AI or other types of tools, the software that's developed to support those tools, and the people that develop the software;
- Engagement, connecting a broader community of people—citizens and scientists—who are interested in data science, but need to learn more about it; and finally,
- Enhancing the workforce, training opportunities for the NIH data science workforce and the workforce in the community.
Those are the five main things.
There's another thing that my office partners in, but doesn't lead, and that's in data science policy. Developing robust, data science policies for sharing and managing data is the purview of the NIH’s Office of Science Policy. As you know, we have a draft Data Management and Sharing Plan that has gone out for comment. We are now finalizing the draft and it should be out in the fall of 2020.
It's an exciting plan. Underneath all those high lofty goals are a number of working groups that implement ideas to support those goals, and our job is to coordinate, to facilitate, to catalyze, to make data science happen through those working groups, and through our ICs.
What are the major problems that you came across in your strategic planning that your office now has to think about addressing?
We have a lot of really excellent data science capabilities at NIH, but they're developed for a particular IC with a particular mission, yet those capabilities could really enhance other ICs. Connecting our capabilities together in meaningful ways is a challenge, making our data platforms interoperable. You could log onto AnVIL, for example, and be able to pull in data from Kid's First, if that's of interest to you, and do analytics there, or vice versa. Our analytics platforms should produce workflows across them, if that's of interest to a researcher. Can we enable a larger amount of data capabilities and analytics platform capabilities for researchers? That would be a tremendous asset to any researcher. That's not entirely possible right now, but there's a lot of thought in the community about how to actually make that happen. We leverage the work of our leaders, of GA4GH, of others, to think about building that connected infrastructure that's larger than any one of its parts.
That's amazing. You're working with 27 different institutes. Is that like bringing together 27 different companies?
Yeah. There's been some stories throughout history of solving this problem. Now, here we are, in data science, tackling the problem that has seen this happen before. Data harmonization is probably always best served by working with the people who generate the data and know the data really well. I think that a technology person trying to harmonize data would likely be unsuccessful, because there's a lot of nuances in data and in language that describes data, that I may completely miss.
So we’ve tackled this by partnering with folks who are really engaged and do the harmonization. For example, HL7 is thinking about how to harmonize clinical research data, electronic healthcare record data, and what can we do to map different standards onto each other. Since that's happening to some extent in the health IT industry, are there lessons learned that we can extrapolate to other types of research endeavors?
I work with the Common Fund, and one of their initiatives is to think about data harmonization across the efforts that they fund. That is an excellent use of time and energy: to get people that work together, to harmonize the data that they know really well.
I think harmonization is the grand challenge, actually. I think that findable and accessible is probably not so hard, and we can address that. Reusable is probably also not so hard, but interoperable and harmonization are incredibly hard. If there are ways that we can think about—and I know this is happening—developing semantic ontologies for data and metadata, just so that we can work to bring those resources together in a harmonized way.
Do grants have a role in incentivizing this? The power of the purse actually helps in that way, to incent behavior.
Exactly. We are incentivizing that cultural change from policy, and from financial rewards. Having both, I think, helps. I think that researchers have to know that it's a value, that it's not a waste of their time and energy. Persuading people that that's true is challenging. They have to see that other researchers are going to use this data, otherwise, it is a waste of time and effort for them.
Is there a difference between data harmonization, and data hygiene?
I think of data hygiene as making sure that metadata is complete, and that the data fields are all filled out, so that the sets are complete. But harmonization is more than just completeness. It's about making sure that if you have an ontology, it can map to other ontologies. That you've put in ways to make your metadata computable, you've redacted schema, so I think it's the next step between cleaning data. It's making it actually, really robust.
Back in January you gave a talk titled, “The NIH Strategic Vision for Data Science: Enabling a FAIR Data Ecosystem”. You have a couple concepts in the title, enabling, FAIR, and data ecosystems. Tell me how those things interplay and work together.
My vision is that data science and data use is different for different researchers and they have different needs. There is no one size fits all. There's no magic platform that will solve all the problems that we need to solve with data, and we need to meet people where they are. In that sense, having solutions that address different research needs for different types of communities is probably the first step in getting people to actually use data robustly in their research. Having an ecosystem of solutions was born from that vision.
That means that, if you want to use data from the PDB [Protein Data Bank], that's the appropriate way that you use data, but if there's no data repository, then we have to think about, how can we actually instantiate a place for people to use data and collect information, and work as a team? That could be the Cloud, through STRIDES, or other. It could be general repositories; it could be the platforms that we're thinking about. But just thinking that in order to really instantiate data sharing and FAIR data use, we have to think about all the different ways in which researchers would want to use data.
We can't just say, "Oh, let's just make data FAIR." Mostly researchers will say to me, "What does that mean for me?" They want to see themselves in the data solution, so that's my way of addressing their needs and their use cases, to say that, "Yes, I get it. I hear you. Your way of working with data is going to be like this, and here's the way that we can think about it, moving forward."
You've gone to a use-case model. Can you tell me a little bit about where that's coming from, how you use it, and what impact it's having?
Yeah. Absolutely. That's sort of a trick that I learned when I was working at the Department of Energy. When we were ready to start a big initiative or a big project, the first thing that we did was to understand the basic needs of the people who were going to work on or use the results of that project. We would do a number of workshops. We would ask people, "What are you doing right now? What do you want to do in five years? What are the challenges, and how can we (this project) address it?” It's really listening to their scientific stories, as they tell them around computing or data. That's how I really fell in love with the use-case model.
I think it’s sort of borne out in industry as well. You listen to your users and you start to develop solutions that address some of their concerns, but in a more technical way.
When I talked with Eli, we talked about the future of what he was calling science DMZs, but he also said that we need to make sure that we also have a conversation about the infrastructure underneath because the networks and how data moves, and is the foundation of these collaborative platforms. I'm curious to get your impression about the importance of infrastructure. How critical is that?
Oh, I think it's immensely important. It's the linchpin of what data science stands on: robust infrastructures that are adaptive to how science is moving forward.
When I was in grad school, I worked on this tiny little Mac, and I had to FTP everything. There was no web (for researchers). I saw the birth of the web for my community. That change in technology is always happening, and we have to move science with it. That means that our infrastructure has to be nimble and adaptable, and our science just has to be nimble and adaptable as well. That is hard when you're moving a very large organization forward, but we still have to do it. Infrastructure is incredibly important, as well as agility in infrastructure. That doesn't mean that you will always want to be quickly moving the ship, but we have to move the ship forward. It can't stay in the port.
That's really the challenge: knowing how to move things forward. Watching the bleeding edge and knowing when that bleeding edge is appropriate for the bigger organization to start working with or working on. It's folks like the DOE that I'd pay a lot of attention to. They are much further along in adopting infrastructures for science than NIH, because that's their mission. DOE has the muscle because of the sheer size of the national labs and the work that they do.
This spring your world was turned upside down. You had worked diligently getting the strategic plan in all your meetings, in all your use cases, and all the things we just talked about, and then COVID-19 happened.
Yeah, absolutely. Before COVID-19, we had the luxury to think about, "Oh, how will we make data interoperable, and how will we instantiate FHIR, per NIH research communities, and how will we make these platforms interoperable?" It was a very luxurious time to work with our communities, in a timescale that seemed reasonable, like a normal eight-hour day.
Now, I sure wished we had made more progress, because now we have to do a lot, very, very fast. If we really want to solve COVID-19, in a serious and meaningful way, we have to get the researchers a lot of data, very fast. But it can't just be a data dump, that won't help them. They have to have data that's interoperable and harmonized. They have to be able to use it for example, for AI, that means that it has to be really robust, with good training sets.
They want all kinds of data images: electronic healthcare record data, death data, claims data, data about molecules that we're testing for drugs. It's a lot of information that we need to get to them, and we aren't ready for that level of data science engagement from the research community, but we really need to be.
Now, we work long, long hours, discussing and strategizing for what will hopefully be a pretty big data science initiative on COVID-19 that we're slowly starting to act on. We are a little bit slower than others, because we like to plan, but I'm very optimistic that our plans will actually come to fruition. We have the funding from Congress to do it, so we're going to roll up our sleeves and do it.
What do you see as your biggest challenges that you're now facing in this new reality?
I think the biggest challenge for us is, we're doing all of this in NIH, with the immense goodwill and hard work, and participation of many, many people, all sitting in our homes. The amount of coordination and just basic meetings to get this done, is much more than I think it would be if we could actually do it in person, but that's not possible, so we're moving forward at a rapid pace.
Okay. Let me ask you, what, out of all this COVID stuff, what have you learned that you didn't know? What'd you learn, or are learning about, that you didn't know before, about your job, data, whatever?
I think the thing that I've learned is just to have a greater appreciation for the amount of different types of research and researchers that want to use data, and the vast amount of data that we have, that we have to work with. I had thought I had a pretty good handle on what we were doing at NIH, that's just taken it to another level. I'm trying to understand, how do you work across agencies that have different goals and different ways of organizing data? How can we actually make this effective? How can we instantiate a very large data culture, fairly rapidly?
I have a much greater appreciation for how hard this is going to be, than I might have originally thought, because we don't have the luxury of time and/or significant amounts of time for planning. We have to make this happen.
Do you think we're going back to normal? Or what you think the new normal is going to be? Because I don't know, I think the genie just came out of the lantern, and I don't think there's a possibility to stuff that thing back.
I think there will be a new normal, and a new way that we work. The fun thing, and the hard thing is that evolution happens by dramatic events. That's the perturbation that changes the genome, at least that's what I understand. I guess one theory is that, you see dramatic events in, for example, the climate, and then all of a sudden, the population kind of adapts.
We have a dramatic event in our working culture that is going to change our life, and the way that we work, and the way that we think about work. One of the things that I think will happen is that we will be much more adaptable at working. Not necessarily in sort of, old-fashioned offices, and old-fashioned work ways, we're going to be much more adept at working basically, virtually anywhere, and much more adept at working in large teams.
But I also think we're going to be much more adept at sharing data, using data, and using data practically for our lives. We are going to be doing much more testing, and much more of work with sharing testing data. I think that is going to interconnect and infiltrate how we use data in the future, and hopefully for good to make a positive impact to data sharing and data infrastructure. I bet that mobile devices will change, and sensor devices will change, because they'll have to.
I think that people find value in sharing and using data. I think that we won't want to go back, that we will actually see that this is a value to science and it's pushing science forward, and it's pushing our lives forward in new ways. I think that we, researchers, and we, community and citizens, will want to continue to take advantage of the amount of infrastructure and the amount of work that we've done to make data available and shared.