Looking Forward: How Data Science Shapes Future Work

July 25, 2018

July 25, 2018 | At the 2018 Bio-IT World Conference & Expo, we dedicated the Wednesday morning plenary session to a panel discussion on data science. It was rich conversation. Panelists laid out their own definitions of “data science”, discussed how to build and structure a data science team, and debated what role expertise (data science or domain) should play in experimental design. As we talked, we collected questions from the audience—more than we were able to use. Instead of abandoning the community’s contributions, I’ve collected the top voted questions from the panel and offered them back to our community. Want to weigh in? Download the question list and join the conversation. A.P.

BITW-Kaushik

Gaurav Kaushik

Associate Director, Data Products & Strategy at Foundation Medicine, Inc.

What is the optimal environment to grow data science?

Gaurav Kaushik: If you want to grow data science capabilities at your company or in your team, I’ve found that the best way to do this is to directly demonstrate the value it can bring to the organization. This can mean articulating concrete use cases and business cases and, if you have the space and ability, to prototype solutions and demo them. Having a tangible demonstration of the value of data science can help build consensus and support. If you’re a junior data scientist, and you want to improve data science at your company, find a sponsor or a mentor. Ideally, she values your goals for personal growth and development, is receptive to new ideas, and can encourages you to think through problems.

What are the different levels of maturity in data science in a lab and what does the roadmap look like?

For me, data science maturity is related to persistence and impact of your data science efforts, not necessarily building towards predictive models. You can have a team that’s great at building neural nets, but continues to spin its wheels when trying to get anything done.

The lowest maturity level might look like this: teams scrambling to collect data, analyze it, ship the analysis, and move onto the next thing. Not taking the time to develop tools for working with the data, documenting analyses, crafting and telling stories about the work, or pausing to think about smarter solutions.

A high level of maturity to me, is when a team is actively learning from previous experiences, creating persistent/reusable research objects, democratizing knowledge, learning continuously, incorporating new ideas and lessons, and is able to take on more work without burning out.

When an organization has a data product team, that can be a great sign of data science maturity—theoretically, there’s a stable data pipeline, a team that is being thoughtful about which use cases to pursue, perhaps even designers and UX experts helping to create human-friendly solutions, and champions for turning data into value more broadly.

How do you generalize processes for data science and analytics when working with heterogeneous (and always evolving) data types?

This can be tough. I try not to repeat myself too much when it comes to working with data. If I do something twice, I write code to automate it the third time. I talk to and work with software engineers as much as possible, and try to learn good coding practices from them. Reusable and well-documented code are helpful no matter what data you’re looking at. Ditto version control. Also, visualizing data is so common when trying to understand data—pick a library and get comfortable. Also: Pandas.

What skill sets would you recommend graduate student researchers have, who are in the process of transitioning to a Data Science industry? Does domain expertise still matter in the era of data science, ML, and AI?

Domain expertise matters, especially in biological data science. Biology is incredibly complex (spoiler alert) and it can be challenging to discriminate between signal and loud noise. Domain expertise also helps differentiate you and your work in a world full of Kagglers.

There’s a story I like to tell about a colleague in graduate school. He got really interested in high-throughput proteomics and wanted to characterize some metastatic cancer cells. Being a chemist, biology was new to him but he was eager to learn. One day, he arrived at lab meeting to excitedly present the results of his experiments. He proudly declared that he had found a protein that was overexpressed in metastatic cancer cells and was eager to publish an abstract on it. That protein was keratin, which is found in human skin, hair, etc. He had contaminated his samples.

The takeaway is this: domain expertise can help prevent garbage-in-garbage-out, especially in a field with high risk of error when digitizing data, such as biology.

How do you define data curation?

Data curation means collecting not just data, but also metadata, documentation, and whatever else you need to understand how the data was recorded/digitized. It also means diving into the data and cleaning it. Ideally, you produce code or a recipe for how to recreate the dataset for others.

How big is the problem of messy data (missing metadata, data that is not FAIR) as a roadblock for data science? Can you discuss tricks and tips to streamline the data cleaning part of data sciences?

It’s the biggest problem. Data analysis software and algorithms have never been more abundant or readily available. But getting harmonized, clean data is still a challenge. One reason for this is that data are often generated without concern for standards or practices that enable reintegration or interoperability with other data. And in some organizations, they may not have planned to become a ‘data company,’ and so their infrastructure is not set up for data science.

Regarding data cleaning, my advice is to build a toolbox with colleagues. If you’re working with a stream of data from your organization, then invest time in creating tools to explore and do basic processing of that data rapidly. Share your tools with others. If possible, bubble critical operations upstream, so the data pipeline is moving ‘cleaner’ data downstream, thereby alleviating downstream data cleaning operations.

ROI & economics are not to be forgotten talking about data science. How do you calculate ROI of your data science efforts?

The activities with the highest ROI for me are those that help set us up for future success or teach us something incredibly valuable. This is, however, not how all organizations see ROI. Often, the perceived contribution of a team is tied directly to revenue, though short-term revenue generation does not necessarily lead to long-term benefits. One approach I’ve used to have a tangible ROI (revenue) and then advocate for activities that will yield longer-term ROI (such as hiring more data scientists, new types of data science activities, better technology resources).

Calculating ROI is important but perhaps more important is communication of it. If your team is successful, you have to take the time to communicate the story of that success to key decision makers in your organization. Tell the story of how you solved a problem, or helped bring in a deal, and advocate for what else you could do with more resources for longer-term ROI.

Are there areas which you feel are ripe to apply data science tools which have not embraced them much yet?

I wish we had better data products for information discovery. I can get an endless stream of recommendations for things to read via Pocket, Reddit, etc. But recommendation engines for peer-reviewed publications leave me wanting. Specifically, I’m not interested in a tool that spits out papers based on what I’ve been reading. Rather, I’d love a tool that zoomed out and gave information on trends in science, relationships between disciplines, emerging technologies, and disciplines that have little co-mingling and are perhaps worth exploring. I want more tools to better learn about human knowledge and reveal opportunities for further exploration. Know any?

Have you had any experience with data science methods being misused because the implementer did not fully understand the methodology?

Misuse of methods happens in every scientific field (see: p-values), and data science is no exception. I see arbitrary model & hyperparameter (or model “settings”) hyperparameter selection a lot, particularly in unsupervised learning. One place where this can be especially problematic is in unsupervised learning, where, to put it simply, you’re trying to identify unique subgroups in your data. I’ve seen descriptions of methods that say “we selected the first 3 principal components” or “we chose 3 basis vectors”—but not why, or why that number is meaningful or relevant to the underlying data. A tremendous amount of value in data science comes from being able to uncover and interpret signals that are latent within the data. When we force-feed data into models, we do a disservice to ourselves.