The Role of Data Management in Advancing Biology

September 22, 2020 Kjiersten Fagnan had a plan: she would be an engineer, work for a company, and write software all day. But in graduate school she was exposed to interesting math problems in biology: modeling acoustic waves as they propagate through bone and soft tissue. She added some classes on gross anatomy and biomechanics to her schedule.

As a post-doc at Lawrence Berkeley National Laboratory, she was working on modeling how groundwater flows through soil and bedrock, but when a position in bioinformatics consulting for the Joint Genome Institute opened at NERSC, the National Energy Research Scientific Computing facility, she applied. She was able to use some of what she had learned about computational biology and the work sparked an interest in genomics. She was also able to help staff at the JGI assess their data management: how they were using file systems and how they optimized some of these scientific workflows. She was gaining a growing appreciation for how dynamic the field is.

Four and a half years ago, Fagnan took on the role of the Chief Informatics Officer at JGI, and, after an internal reorganization, she now leads a department dedicated to data science and informatics. “Our focus right now is on breaking down barriers to the access and use of data that JGI produces,” she says.

Stan Gloss, founding partner at the BioTeam, recently spoke with Fagnan about the community challenges around creating and sharing more useful data, and how “human engineering” can help advance biology. Bio-IT World was invited to listen in.

Editor’s Note: Trends from the Trenches is a regular column from the BioTeam, offering a peek behind the curtain of some of their most interesting case studies and projects at the intersection of science and technology. The transcript below has been edited for length and clarity. Fagnan will be speaking at the upcoming Bio-IT World Conference & Expo Virtual. She’ll be participating on the Trends from the Trenches plenary panel as well as speaking in the Data and Metadata Management program.

Stan Gloss: Where was data when you were breaking into the field? What is your perception of the history of how we got here in terms of data?

Kjiersten Fagnan: When I started working with the Joint Genome Institute, they were in the midst of the exponential growth curve caused by next-generation sequencing; JGI was able to generate much more data than they had previously.

A lot of the old methods that were being used to track and manage data and make it available to our external User community were breaking. We moved from a place where you could keep all the data on a hard drive or network attached storage on a single server to needing large parallel file systems, and shifted from having thousands, maybe tens of thousands of files to having millions and then billions of files. The old method for naming your file path something that was meaningful to a human so that you could find data was insufficient for anyone actually tracking down data that were generated a year ago or possibly even a month ago.

That was the state of things when JGI moved its computing over to NERSC. We realized that one of the first problems that needed to be tackled was data management. Part of our mission at JGI is to make our data available not just by doing deposition into the NCBI SRA depository, but also to have our own access plans for analysis and use of the data.

You just can't expect there to be humans that can keep track of where all the data are and so we created a data management system that lets the JGI staff who are producing data with their pipelines tag all of the datasets that are getting generated with appropriate metadata such that metadata in a document store in a NoSQL database can then be queried by the next group that are going to be using the data. This became the newer backend for the JGI Genome portal which is how most of our users are now accessing and downloading the data that are available to the public.

The data management project represented a lot of work. We had to rethink how the JGI staff were using the file systems. We had to rethink how to structure the storage infrastructure to meet the different needs of the workload. There is a natural transfer of data from sequence data generation to quality control and assurance (QA/QC) to genome assembly to genome annotation; the computational workload of each of these activities has different characteristics.

The sequencers require an isolated file system to ensure data integrity and minimize disruptions from downstream analysis. However, the subsequent groups who would access the data needed that data to be available on high performance computing resources. We had to deal with data transfers and management of data across different file systems. Once the data management group is done with initial processing, there's an automated system that lets the QA/QC group pick up those data as inputs for processing on a separate file system. Once the QA/QC groups are done, their output gets picked up by the assembly group as input and gets moved to separate file systems for subsequent analysis.

We created this linking between all of the different groups that need to touch and process our production data that comes off the sequencers. And then we created the system for managing and tracking all of those links. That enabled the systems to be automated so that JGI could scale up operations when it came to that fundamental data processing.

That's just what's needed to make these data useful to our collaborators in the scientific community and to the scientists that we work with. The deployment of the data management system required this overhaul of the underlying storage infrastructure.

And we haven't even really talked about the downstream analysis when you want to pull thousands of data sets into a single analysis that might come from a variety of different products. Our new data management system also helps our staff and collaborators query across all of JGi’s resources so that they can reproducibly access the data and analyze it when it comes from different data sources.

It's fascinating. So how long have you been actually doing this work?

Almost seven years that it's been in operation at JGI. The data management system was deployed in 2013, about a year after I started at the JGI. We were shutting down an old file system and moving to a new file system, so one way we encouraged the right use of the new file system was to say, "Hey if you move over to this data management system that we built for the JGI we will handle the movement of all of your data so it'll get onto the new file system. It'll be secure."

And so we got a core group of the organization actually using it that way. And then over time we’ve worked with any outliers to get them set up and using the system as well.

Wow, that's amazing. So is this that project called JAMO?

It is—the JGI Archive and Metadata Organizer.

Is that open source?

We haven't completed the open sourcing of the software and we really should. As part of the National Microbiome Data Collaborative, we want to repurpose the JAMO software. When the software was written it was pretty tightly coupled to the NERSC infrastructure. Since then we’ve been working on removing those dependencies and making JAMO distributed. That is happening as part of National Microbiome Collaborative effort.

It sounds like your mission right from the beginning was to make the data FAIR: findable, accessible, interoperable and reusable.

That's a reasonable characterization. I think FAIR has really codified a way of thinking about data that's incredibly aspirational and resonates with people. One of the biggest challenges we're facing in this field right now is findability of the data—search is a hard problem. Then let's say you manage to find some data that you're very interested in; a lot of the time it's not clear whether or not those data are accessible to you or to the public.

There's been a large push over the last decade to make everything reproducible, to make the data accessible, to have a data management plan.

A lot of that effort isn't necessarily resourced, so just because you have a data management plan doesn't mean that you have a clear place where you can actually put data. We're lucky that the Sequence Read Archives exist and that the NIH continues to fund it, because that's become one of these major focal points for collecting the data.

But even more than that, when you're in the middle of collecting data for a very specific question, you're not necessarily thinking about what other information to collect to make these data useful to other groups or other labs. That's not a part of the thought experiment that you're going through in that moment. And then it becomes kind of hard to go back and figure out what you were doing or figure out what some of those missing pieces of metadata might be that can provide enough context to another researcher.

I think that's really the struggle right now: this idea of at least moving toward more findable data relies on getting people and encouraging folks in the community to think that way. What would it take for somebody else to use your data? There's a lot of work happening in the development and assignment of ontologies. There's a lot of work happening in trying to describe minimal sets of metadata, but then you have to apply that throughout the processing pipeline and biological data are incredibly complex. It's challenging for sure.

As a user facility, it's really important that we at JGI do our best to make the data that we generate— that's been funded by federal tax payer dollars—as available and accessible and findable to the community as possible to promote their reuse. I am proud of some of the work JGI is doing to make our data more FAIR. We have spent most of this year working to refine the graphical user interfaces to our data—https://data.jgi.doe.gov—and we are about to pivot to a revamped API. JGI has worked really closely with our user community to ensure these new resources work well, and I think community support will be a key driver in making our data accessibility efforts a success.

I think we're at this moment with technology where all of these things feel like they should be more achievable—with containerization, with ontologies, with the ability to assign metadata dynamically. But there are still social engineering threads that need to be pulled together so that the community and the scientists understand why they're being asked to do some of the boring work of curating their metadata.

Yes! In my interview with Matthew Trunnell he said, "Eighty percent of the time that a data scientist works, is spent organizing and cleaning up data. And the other 20% is complaining about organizing and cleaning up data." How true of a statement is that to you?

Incredibly true! A current example is trying to wrangle the needed metadata for one of our National Microbiome Collaborative projects. Finding this contextual information has required the effort of a handful of staff to go back through laboratory notebooks, OneNote documents, Google Sheets and things that might be in a Google Drive, to find links between mass spec data or sequencing data and the common samples that were provided by the PI.

The bulk of the time isn’t being spent looking at anything new in the data. It's just trying to sanity check and make sure that the data you have is what you expect. You may find something in the NCBI repository, and you hope it has the correct metadata assigned to it, but you always need to do a QC step to make sure that what's actually reported to be there is what's in the dataset itself, which relies on having excellent reference databases to query against. And you need the ability to at least follow up with the PI or the organization that did the submission in case you find that something is missing or incorrect.

It's just a very high touch process right now. It involves a lot of digging around for something you didn't necessarily generate yourself. Or if you did generate it yourself, it was one or two years ago; you've moved on from that project and it's hard to stay organized.

When you want to analyze the data and the data hygiene hasn't been kept up to date, it's just incredibly time consuming to make sure you've got what you're interested in, and that it’s of useful quality before you hit start.

Yeah. This sounds very familiar. When we look at these kinds of problems, what percentage of the challenge is a human engineering versus a technology thing? It seems like we can develop databases and tagging systems and all of that, but the complexity comes in when we have to deal with human beings.

I've thought about this a little bit from the perspective of the advantages that industry has over scientific domains right now. When you think about some of what you need for different data science applications, you need a lot of well-labeled data. Industry has been able use those little recaptchas to get us to identify and label data on their behalf, which is pretty clever. But with scientific data, it's a little bit harder because I can't just give a FASTA file or a FASTQ file to anyone and have them identify a gene out of that data. I can't give a mass spec output to a relative to have them do any sort of labeling that's scientifically meaningful.

We need highly trained scientists with PhDs who have a lot of experience looking at the data to actually go through and help us figure out all of the data. But that's not necessarily how you want these highly trained PhDs spending their time.

So we come up with principles like FAIR, we come up with ontologies, but if you talk to anybody who's looked at ontologies or developed schemas for these data it becomes complicated really quickly.

The technology's only as well-informed as the humans that will spend the time to work with it. So even if we wanted to automate metadata assignment for different data files, you need to have somebody that sits down and identifies yes, this makes sense. Or no it doesn't. And we're not at a point where we can necessarily have machines do that for us.

If we could convince a core group of scientists that if they spent the time on it, then the machines could help us. This would accelerate discovery and accelerate science; I think scientists might buy into the process just a little bit more.

I would love to see a sponsored event where we ask people to use other people's data for analysis and research and we somehow have a feedback loop to the people who do a good job labeling their data or providing context and metadata so somebody else can actually use it. We need to incentivize everybody sitting down and at least trying to do this once. I'd love for even scientists who've been in the JGI to try using one another's data without being able to phone a friend and get help.

They might come to understand how hard it is, but then it might also reinforce behavior like, "Well I'll just generate more data myself because it's cheap."

There seems to be a problem between the people who produce the data and the people who consume the data. And there seems to be some kind of disconnect in the incentive system between those two.

If we think about the scientific productivity at the Joint Genome Institute, for the past decade or more it's been about our production of sequence data as well as our production of user-generated scientific publications. When we challenge ourselves to think harder about the impact of JGI data on scientific productivity, it's really a lot harder to track that down and to figure out who's been using your data, what they've been using it for, and if it's been published. And so that becomes this interesting thought experiment. If you can't even figure out how the data are being used in the wild it becomes impossible to quantify that impact.

There is also a behavior pattern where a scientist contributing data into these repositories may have provided incomplete metadata or obfuscated something because they have been mandated to deposit their data, but they don't consider that they're done with it yet. So they can check that box for their data management plan, but if somebody else goes to use it, they might not be able to.

If you're a scientist in a small lab, you don't necessarily want somebody else to publish on that before you get a chance to. You've done all this work to collect and generate a data set and there's also just a, there is a disconnect between what's getting funded and the behavior we want to see.

I think behavior change needs to start with funding agencies and journals working together to come up with a different set of metrics and challenging people to think beyond their publication to explaining the impact of the work that they did.

Something I would love to see that I heard Barend Mons from GO FAIR promote is the idea that the data should be what gets published in great detail with a lot of thought and a lot of care that goes into describing where they came from, how they were produced, all of that contextual information to make the data reusable. And then the scientific story about how you used to data is a secondary publication. I’ve been told this is an impossible ask from a number of scientists because they’re motivated to publish their scientific insights, not data reports. This leaves me with the question, which advances science more, the data and story from a single lab, or those data shared in the context of and linked to previous work? I’m not sure, but given the narrative around “big data”, it seems like we should be attempting to answer this question to empower the funders, journals, and institutions to drive the needed culture change around data as first-class citizens.

One concept that we've been kind of toying around with at BioTeam is really the thought of micro publication. What would the world look like if we broke the process down and gave attribution along the way? So that data producers who produce high quality data get a high rating as a data producer, and then other people who could pick up that data and get attribution for doing the analysis. So it's a way of sharing attribution down the pipeline in each different phase such that everybody who touches it collaboratively basically achieves attribution and they really stay focused on the things that they're really good at.

I like what you're suggesting about different roles where you get credit throughout that whole production pipeline. I might make the case that we don't necessarily have the infrastructure in place to enable that right now. I think it's more of a software infrastructure problem than anything else. How do you create the right kind of usable systems for folks so that we reduce some of the burdens, reduce some of the barriers to creating high quality data sets, and where high quality also means you provide all of the information about the data so there's clear connectivity back to the data producer. Can we make it easy to provide attribution and citation?

One of the things we've been doing at JGI is trace links to data we’ve produced through publications. But some of the identifiers that JGI was using in the last 10 or 15 years are indexed and are difficult to find.

Having some sort of protocol in place that helps both the data producers create clearly identified datasets so that a data user or consumer can actually pick those data and know where they came from. Users will also have more confidence in the data if they know the source. Even better would be the collocation of data and computing so the datasets never need to move, but could be “visited” and analyzed in place.

It feels like the technology exists to solve this, but I would say that the software infrastructure or the human usable systems aren't in place yet to facilitate that relationship.

How much do you think lack of organization of data is holding us back from taking advantage of machine learning and potentially artificial intelligence? Is that holding us back somehow?

Yes, lack of organization is holding us back. One of the things I like about the FAIR principles is that the end goal there is not to have necessarily an infrastructure for a human to be able to go through and systematically find all of the data they're looking for, but you're trying to make all of these data machine accessible and trying to set things up so that the data can be linked and eventually maybe that feeds into better and more powerful search of the data for humans as well.

There are a lot of challenges with having enough scientific data to be able to take advantage of machine learning or artificial intelligence, but it needs to be more of the right kind of data. In biology we can't necessarily answer questions about how microbes respond to drought tolerant environments if all of our microbial samples come from a marine or wetland environment.

We need to have enough information and enough data to be able to interrogate things and to be able to look for patterns. If you don't have some sort of well-organized structure around the data themselves, the pattern matching and finding becomes harder.

More than this, it feels that more sophisticated algorithms are needed to make it possible to query across the billions of omics objects and their environmental data. Over the next few years we are likely to see statistical learning create new capabilities for scientific inquiry, however, biologists, statisticians, and computational scientists must collaborate to make this a reality.

Roll the clock back 10 years. What advice would you give yourself knowing what you know today?

If it's specific to data, I would have challenged our crew to think early about the possibility of having even JGI data distributed across multiple facilities and centers with computing and storage collocated but distributed geographically—because that changes the way you think about framing and building a system.

I came into this work with this high-performance computing background where we really like to think about performance and scale and optimization, but I would sit myself down early and explain how dynamic biology is, how much things are changing, and how much our understanding has the potential to change on really different time scales.

Understanding that change gives you a different appreciation for what's needed, both from the hardware and software infrastructure perspective, but also in the user interfaces needed to support and empower these communities. I probably would have started advocating earlier for some of the software infrastructure that we're trying to build now to address the fact that everything is moving into this more distributed framing.

I would have told younger Kjiersten that, yes, it's going to be very hard to find partial differential equations and but there really isn't a single deterministic answer to a lot of the work that people are doing because you're studying and perturbing living systems, and so these answers really do look like distributions, not scalars.

I would have prodded myself to continue thinking harder about statistics, which is maybe not one of my favorite subjects, but it's of critical importance when you're trying to understand something with so much variability.