Matthew Trunnell On Data, Silos, COVID-19, And Advice To Himself 15 Years Ago

March 19, 2020 | TRENDS FROM THE TRENCHES—Stan Gloss, founding partner at the BioTeam, and Matthew Trunnell, now data commoner-at-large, formerly of the Fred Hutchinson Cancer Center and the Broad Institute, have a long history of collaboration. The two have worked together for more than 20 years.

They recently sat down together to discuss how their views of data have changed over the past two decades, what technologies provided tipping points, where the next leaps will happen, and what we may have done differently with the benefit of hindsight. And, of course, a little bit about coronavirus. Bio-IT World was invited to listen in.

Editor’s Note: Trends from the Trenches is a regular column from the BioTeam, offering a peek behind the curtain of some of their most interesting case studies and projects at the intersection of science and technology. The transcript below has been edited for length and clarity. For more from this discussion, including the impact of Moore’s Law on sequencing and the future of interdisciplinary research, see the full version at BioTeam.net

Stan Gloss: Let’s go back 20 years to when we first met each other. Data didn't seem like an issue back then. Why not?

Matthew Trunnell: Twenty years ago, the volume of data was not tremendous. GenBank was still being distributed on CDs. The challenge was the kind of computation that people wanted to do. BLAST was the big consumer of CPUs to do all versus all kinds of comparisons searches. Scientists were outgrowing their desktop workstations, and that was really the opportunity at the time.

We were still looking at results more than data. When I got involved in this pre-Blackstone (pre-2000) working at Genome Therapeutic Corporation, there was a data product, this pathogen database. Celera and others were getting into the data product space. But the idea of generalized data reuse was just not there really. We had reference data and we had primary data, but we weren't actually making secondary use of primary data.

Stan Gloss: The NCBI provided a centralized resource for genomic data. Was some data being reused?

Matthew Trunnell: I think genomic data was the leading edge of data reuse because genomic data—even if it wasn't collected for the purpose that you are interested in—could provide a reference that would increase the statistical power of your analysis. And in many ways the introduction of short-read genomic data was the beginning of data science and data engineering in our space.

Stan Gloss: Why do you say that?

Matthew Trunnell: That was when we started seeing the Bayesian modelers come in and the tools that were developed for analyzing short read data like GATK. GATK is a statistical tool package, very different from what we were using to analyze capillary data. So you have the biostatisticians that are doing conventional statistics, and GATK and most of the modeling stuff continues to be done to kind of analyze genomic data, whether it's GWAS or the primary upstream analysis associated with alignment is probabilistic modeling.

So when we started seeing a lot of short reads sequence, we suddenly needed not just software engineers in the lab to make the data coming off instrument usable, but we actually needed data engineers, data scientists in the lab. We hadn't needed that before. I would argue that data engineering was just not very big for most organizations. The Broad was out in front of it and kind of hit it in 2007, but it was several years after that before other organizations started to feel that same organizational gap: the gap between IT that knew how to store and manage data in associated compute, and the research that knew about the data.

That was a fine separation for a long time. IT didn't need to understand the contents of the data in order to run the investigators' tools. But we came to a point where investigators began to outgrow their familiar tools (because of data size/complexity), and it wasn't going to be IT that solved that, at least not the IT organizations of yesterday.

And that's when data, for me, started becoming a thing. There were various aspects of data that became more important. The first is a very practical thing, the storage of genomic data. In this case we were spending $6M/year on new storage hardware to store data in a form we knew had not been optimized for storage. This was not a problem IT could solve by itself because it required a research-level understanding of the data. Yet it was not only a research problem either.

Stan Gloss: When high-throughput short-read DNA sequencers started coming online, was that a defining moment in data production?

Matthew Trunnell: Yes. On the life sciences side and the research side two things happened: one is that it removed a bottleneck in data generation. One of the rate-limiting factors in genomic analysis certainly was data generation. People would spend their whole careers isolating a single gene. Suddenly we had this ability to generate data with tremendous velocity. But the second thing is that the data being spewed out of these machines weren't usable by the average researcher. The whole introduction of data engineering and data science into this space was how to distill that data coming off the sequencers into something that could be usable.

Stan Gloss: It seems like what you've described is kind of still where we are today. There's a lot of data out there with many different data formats from many different types of instruments that are still pumping out tons and tons of data. Now more and more people want to ask better questions of more data, but they find themselves stuck in some ways.

Matthew Trunnell: Yes, it's not just the sequencing. We're seeing now advances in microscopy and CryoEM. It's the advances in the laboratory technology that are driving the volumes of data that are overwhelming the existing infrastructure for data management and analysis. Microscopy has just been continuing to drive technology. The thing that advanced even faster than Moore's law for CPUs was the performance and reduced cost of CCD's, digital cameras. And so that CCD technology has also had an impact in in this space. Flow cytometry is another great example. And then in the last two years there has been this rise of single cell, which of course is all of this tied together.

So, where we find ourselves now is A, we have a ton of data and B, we're generating large scale multimodal data. And so now we find ourselves back in this place of, okay, I have high resolution imaging data and I have RNASeq data. How do I analyze these together?

Stan Gloss: One of the things that I've noticed in traveling around and talking to people is they still have a problem with their data being locked up in silos. The way in which organizations are structured by dividing people up by specialties has created a culture of silo building. You've been part of the Broad and the Hutch and all these places. What do you think about the culture of scientists? Do we need to start thinking also about people engineering, not just necessarily technological engineering or data engineering? Where do people come into this?

Matthew Trunnell: The technology systems we've seen are a reflection of the organizations. That's the way the technology evolves. There are two things that drive this in my mind in this space. One is that biology itself has been traditionally a very reductionistic practice, right? You've got your cell biologists and your molecular biologist and you look at the NIH and you divide everything up by disease area. There's no single center institute at NIH, for example, that deals with inflammation or remodeling. And yet we know that spans so many different diseases, but it doesn't fit into the NIH organizational structure. This is also how academia works. Academia is built around individual labs and all the incentives are built around individual labs.

So, we had a culture that still is to a large degree emphasizing silos. It's hard to lead that silo busting with technology because you still run into all the cultural and social issues around data sharing and the sort of fundamental imbalance around the cost benefit of data sharing: To make data shareable requires more effort on the data producer's side, but the benefit is realized by the data consumer and not the data producer. And that fundamental imbalance will continue to drive things toward silos without some investment of efforts on the parts of organizations.

Stan Gloss: Is the way in which scientists gain attribution for their research a barrier to sharing data?

Matthew Trunnell: Absolutely, that's because attribution is the currency of academia. So the entire academic system is built around attribution. It's an interesting time right now with the COVID-19 where organizations will make a big deal about sharing data immediately. Like, oh, we're doing this for the public good, and I don't understand why we can't push people to the realization that that's exactly the same thing with cancer. Cancer is a bigger epidemic than COVID-19. It's just not spreading the same way. The value of putting data out into the public is exactly the same and I find it frustrating that we'll pat ourselves on the back for sharing some data while we continue to go on and hoard other data.

Stan Gloss: Do you think we'll ever get to a time where datasets, clean datasets, kind of look like movies that we get from Netflix and the ability to have a system of streaming high-quality data?

Matthew Trunnell: The streaming is an interesting point. I believe that it is true that in some areas we will benefit from thinking not about data as sets but as time-dependent streams. And there's a lot of reason for that. I mean, now that all of our ability to produce data is going up so fast, but to the extent that we keep looking at things as static datasets, which is the NIH view. They're thinking in a backwards way rather than thinking in a forward way. We're going to have data spewing out these various sources, and we do now. I think the thing that's going to keep slowing us down in biomedicine is our relative inability to deal effectively with clinical data.

Stan Gloss: Why is clinical data a problem?

Matthew Trunnell: There's traditionally been a pretty big gap between the research side and the clinical side. There's great interest in bringing those areas closer together and building a more “learning health system.” And this is why the position of Chief Research Information Officer was created to try to bridge those two domains. The challenge is that as a data producer, the health system sees very little value in participating in a broader ecosystem. That's a general statement, but hospital leadership is focused on business metrics on a five-year horizon, and there hasn't been sufficient demonstration of a learning health system.

Some are starting to see opportunities with Google and others to look at their data as an alternative source of revenue. And that's of direct interest to the hospital leadership. Hospitals are crazy businesses with razor thin margins, so if they can find novel sources of revenue, that's a win. When a group like Google says, “Give us your data, we'll do other work on it. You don't have to do any extra work. Just give it to us and we'll give you some money,” that's appealing to some. I believe that until the hospital systems in healthcare enterprises get better about data, we are going to be hampered in our ability to drive all of the great advances on the research side into the clinical side.

Stan Gloss: I’d contend that the Coronavirus infection rate is a prime demonstration of the gap between research and clinical.

Matthew Trunnell: I would argue that too. Trevor Bedford, who is at Fred Hutch and has been leading covid-19 analysis efforts since the first deaths here, has become a statistical spokesperson for coronavirus. He was already relatively geared up because this is second year of the Seattle Flu Study, an effort to study the propagation and mutation of flu virus. All of the analysis pipelines, sequencing, and sample collection is already in place, which is great.

But we still don't have any of the clinical data. How can we expand beyond our local health system to be pulling in the data we need from all of the other Seattle healthcare systems? And that's a big problem.

Healthcare data is the prime example of data that was not collected for secondary use, right? You collect data in a clinical setting for the sole purpose of treating that patient and that's reasonable. That's the mission of the health system. Apart from efforts around quality improvement, there's very little systematic reuse of those data and I think that's a huge opportunity. That's one of the areas that I'm pretty excited about.

Stan Gloss: So we have researchers and clinicians that work on one question or patient as a time rather than thinking holistically?

Matthew Trunnell: I think that's true and that's not an unreasonable research approach. The thing that we really haven't seen come to penetrate healthcare and life sciences is this so called fourth paradigm of science, which is data-driven hypothesis generation. And I think it will have big impact, but you have to have the data in order to do it and it's a really interesting challenge to try to talk to an ethics review board about the value of data-driven discovery when our whole system of human subjects research is focused protecting individuals. Data-driven discovery is a really hard sell.

Stan Gloss: If you could look back 15 years and advise yourself with the knowledge that you have now, what advice would you give yourself?

Matthew Trunnell: Good question. I think 15 years ago I wasn't thinking data. I was thinking storage. I was thinking of data as nothing more than a collection of objects and I didn't care what the objects were, and if I had been on this data bandwagon 15 years ago, I think I could have had more impact. Certainly we would have gotten the Broad to a place closer to where it is now much sooner.

Stan Gloss: Right, I know. It's kind of like back then people thought of data as grains of sand and it's almost like we have to change the perception of piece of data is actually a seed and not sand. It's something that with nurturing could actually grow into something.

Matthew Trunnell: Yeah, I like that.