Data Science Foundation: Findable, Accessible Data

November 19, 2018 | At the 2018 Bio-IT World Conference & Expo, we dedicated the Wednesday morning plenary session to a panel discussion on data science. It was rich conversation. Panelists laid out their own definitions of “data science”, discussed how to build and structure a data science team, and debated what role expertise (data science or domain) should play in experimental design. As we talked, we collected questions from the audience—more than we were able to use. Instead of abandoning the community’s contributions, I’ve collected the top voted questions from the panel and offered them back to our community. Want to weigh in? Download the question list and join the conversation. Allison Proffitt, Editor

Adam Margolin

Professor and Chair of the Department of Genetics and Genomic Sciences; Senior Associate Dean of Precision Medicine; Director of the Icahn Institute for Data Science & Genomic Technology

Icahn School of Medicine at Mount Sinai

(See Bio-IT World’s interview with Margolin on his plans for the Icahn Institute for Data Science & Genomic Technology.)

Bio-IT World: What do you think is the optimal environment to grow data science?

Adam Margolin: We want to build a program for the way science will be done in the future. That is very different from the traditional model of science, which was built around somewhat isolated researchers and research groups working on their own individual problems and without really a goal of energizing efforts to tackle big problems that can only be done through large cross-disciplinary teams. Science today and in the future—especially science that uses data science as a foundation—really looks different. It requires teams of people working together to glean insights from massive amounts of data that were organized from sources around the world and throughout our health systems. And working hand and hand on a daily basis, with experts in complimentary areas, basic biology, technology development, with the goal of advancing big priorities that can only be done if teams of people really work together towards this goal. The optimal environment for modern data science should have team-oriented, focused, mission-driven strategies at its core and support teams of data scientists and cross-disciplinary teams who work together to advance strategic goals in ways that can only be done through that kind of team science. It couldn't be done if they were working in isolation.

What are the different levels of maturity in data science in a lab and what does the roadmap look like?

In a mature data science organization, you continuously add value to a common resource of data and analytic tools. Contributions from data scientists across the team will build on each other, such that each person's work enhances the work of the next person to achieve results faster and on a bigger scale. As people within that group work to organize data, they'll continuously improve the overall resource for many research uses, and incorporate feedback from real usage of the resource and continuously improve it. Similarly, as scientists develop analytical tools to glean insights and interpret data, those would be implemented in modular, reusable, expansible software tools that could be components that could be applied, extended, or modified by other people on the team. A mature data science organization would have that kind of virtuous feedback loop, where they are continuously improving their methodology through data resources, though integrating the contributions of each person within the team.

By contrast, a less mature data science group would reinvent the method to organize, process, analyze data for each application that they're working on. Meaning, each researcher, if they wanted to analyze a dataset, would be responsible for figuring out how to curate it, process it, store the data, and develop the tools to analyze it. They would not benefit from leveraging the contributions of everyone else on the team.

How do you generalize the processes for working with heterogeneous and changing data types?

The key is making the core, foundational data elements broadly accessible and usable. So, many of the ways to interpret the base data will rapidly evolve in a research context, and those studies can go on in an individual research lab without integrating back into the core research until it demonstrated maturity and validation of the method. I don't think we need to generalize all of the rapidly evolving methods, as we figure out how to work with heterogeneous data, but we do need to work from the same base foundation. The low-level data elements need to be widely accessible, usable, comply with standards ideally, such that people that are developing new ways to use this data are all starting from the same foundation. If we work in that way, then it is easy to incorporate new ways of processing and analyzing the data as they mature, because they can be added on top of a common foundation. But if everybody is reinventing from first principles, that's, first of all, inefficient and second of all, difficult to integrate the new approaches that will emerge from each group.

Related to Mount Sinai's new educational programs, what skill sets do you recommend that grad student researchers have, or people that are wanting to move into data science? Does domain expertise still matter?

As you mentioned, we're building out a new training area in the Data Science Ph.D. program and the new Data Science Master's program. As we’re thinking about those programs, the things that I think are most important are to have a good foundation in the core areas of data analysis, including statistics, machine learning, artificial intelligence, really those foundational domains of data science analysis that are applied across industries for any data science scientists.

Domain expertise matters, for sure, though a lot of that can be learned in the course of doing research. I think it's important for training to give a solid foundation in the well-established, widely-used areas of data analysis. If data scientists can be trained in domain-specific applications, that's great. If not, a lot of that can be worked through on their own, in the course of their research or during their educational program.

How do you define data curation?

Making messy data less messy. Mapping messy data into some kind of structure that can be interpreted by a computer, such that, it's possible to use computational methods to analyze the dataset in its totality or multiple datasets in comparison with each other, which requires standardization of the data in a way that the computer can understand. The other element of data curation is just making data findable and accessible. That's actually more important than anything. If you talked to a really great data scientist they'll probably say that if they can find the data, even if it's on a directory or hard drive or something, they can figure out how to use it. But they spend so much time not even being able to find the data. It lives in places that are not accessible to them. The first thought of data curation is just making the data findable and accessible. That's most important. Beyond that, make it standardized and computable, which is also important, and I think, things that we can solve if data is broadly accessible.

How big of a problem is messy data?

Messy data is a big problem, but inaccessible data is a much bigger problem at all scales, within institutions [and beyond]. We need to aggregate data across the global community. Messy data is a big problem, but great data scientists can solve that problem. So big tech companies like Google know how to work with messy data. But if they can't get the data in the first place then there's not much innovation that can happen. So the most important thing for the field, in my opinion, is to be able to access data that's coming out of global molecular profiling studies, now that DNA sequencing and other molecular profiling methods are becoming routine in clinical process. Meaning that in years to decades, hundreds of millions of people will have billions of data points measured on their disease. If we could integrate data coming out of all those studies, it would revolutionize our ability to understand the patients or populations who are likely to respond to therapies. That rests on whether or not we will be able to integrate and share that data across the global research community.

ROI & economics are not to be forgotten talking about data science. How do you calculate ROI of your data science efforts?

There's direct monetary returns on investment. There's scientific returns on investment. Then there's return on investment to patient benefit, and, ideally, those things are correlated. As a scientist, I try to spend most of my time thinking about how we can be good scientific stewards of the money we're given in order to make the most impactful scientific advances and have the maximum impact on the patient. In reality, we need measures for that. Like with most academic institutions, funding is the primary metric for return on investment, as well as other sources of support for research from philanthropy and from industry partnership. Those sources are the primary metrics for return on investment. My hope is that by doing well on the metric means that we're actually doing well for the science and for the patient. The return on the investment is kind of the score card of how effective we were at advancing science for the patient benefit. Sometimes it works that way and sometimes it doesn't.

Are there areas which you feel are ripe for data science, but which have not embraced them yet?

Being in medicine, I think it's probably about time that our clinical practices abandoned the fax machine. But I think data science is becoming pretty pervasive at some level in all industries. There's certainly data science that's done well and data science that's not done well and with all kinds of levels of sophistication. I'm not sure that I know of an industry that ignores data science all together. In biomedical research, what I think is most important is integrating data science much more closely with testing and translating all the discoveries that are coming from the data-enabled predictions. Meaning, using the predictions derived from large-scale data to inform experiments and which therapies to pursue translation to patients. So really using data science as the foundation of new discoveries that can be integrated into our discovery and clinical and translational processes.

Have you had any experience with data science methods being misused because the implementer did not fully understand the methodology?

All the time. If you go to the literature there's published studies that make mistakes in how to analyze data. When I’m asked to peer review a paper, it’s quite common that I'll find an error in the data analysis methodology that leads to an incorrect result. And I like to think I catch those when I review papers, but it's pretty common and certainly lots of mistakes are not caught. You know, there's real concerns about our reproducibility of findings for that reason, and an additional concern is not necessarily that data analysis is done incorrectly, but rather there's bias in the assessment of methods because the same researcher is developing a method and then trying to show that it performs better than other methods.

Every paper will report by some metric that their method's better than the others. And of course every method can't be the best one. One way around that that I've been passionate about, is to set up crowdsource challenges and benchmarking-type competitions, working through the DREAM Challenges mechanism for example. That separates the process of developing a method from assessing the method by establishing predetermined objectives, benchmarks that are blind to people developing the method, and then asking them to solve an analysis problem and submit their best approach for analyzing the data. That alleviates a lot of the issues that we see in self-assessments and self-reporting that each method is better than every other method.