Denial And Data Science: How You Must Change Your Thinking To Work In A Data Driven Environment

August 3, 2018 | At the 2018 Bio-IT World Conference & Expo, we dedicated the Wednesday morning plenary session to a panel discussion on data science. It was rich conversation. Panelists laid out their own definitions of “data science”, discussed how to build and structure a data science team, and debated what role expertise (data science or domain) should play in experimental design. As we talked, we collected questions from the audience—more than we were able to use. Instead of abandoning the community’s contributions, I’ve collected the top voted questions from the panel and offered them back to our community. Want to weigh in? Download the question list and join the conversation. A.P.

DataScience-Zeitlin

Samantha G. Zeitlin, PhD

Lead Data Scientist, Denali Studios

Bio-IT World: What is the optimal environment to grow data science?

Samantha G. Zeitlin: Step 1: Be in a data-driven environment. It's a lot harder to grow data science if management thinks everything should be done on “gut instinct” alone, or “the way it's always been done”. The first thing everyone learns when you start using more data is that some long-standing assumptions are just plain wrong. And then you can choose to be in denial, or you can choose to change your thinking. All the data in the world won’t help if there’s not a culture to support data-driven progress.

Step 2. Have the infrastructure to support data. Ask these questions: Are you logging events? Can you send data from distributed equipment to some central source? Can you easily join disparate data sources together? Is your data structured in a way that makes it easy to work with? If you’re missing any of these things, your data scientists have to be data engineers first, to get the data into a shape where it's usable.

Step 3. Have a close relationship, built on trust, between the data scientists and the subject-matter experts (whether that’s the bench scientists or the product owners). Have data scientists who know enough engineering to understand the trade-offs between ideas and implementation, and who have enough communication skills to find out what’s really needed and what’s really doable.

What are the different levels of maturity in data science in a lab and what does the roadmap look like?

In the early stages, you’re doing everything ad-hoc, with small data samples, or it’s just very slow. One person can do this. But it’s hard to automate, it’s hard to re-use anything, it’s hard to build tools, because you just don’t have time to do everything yourself. You have to rely on proprietary software, which limits your flexibility.

In the next stage, you have some documentation, like data dictionaries. You have more than one person solving problems, so you can build infrastructure and also handle ad-hoc questions as they arise. You still have way too much to do, so you spend a lot of energy prioritizing and re-prioritizing, and you do that by clarifying what’s required for publication/grant funding/revenue impact, and weighing the long-term and short-term consequences of your design choices.

In the most mature situation, you have automation to handle the repetitive tasks. It’s easy to get the data you need, so you can spend more time thinking about innovative solutions to difficult problems. You can put new models into regular use (ideally, into production) and iterate on them.

How do you generalize processes for data science and analytics when working with heterogeneous (and always evolving) data types?

You’ve got two choices, basically. One, you have separate sets of tools that work with different kinds of data, and you optimize them as much as you can. Two, you build the tooling to funnel data from one shape (say, unstructured json) to another (denormalized tabular).

In reality, of course you’re going to need both. Personally, I try to make my tools as reusable as possible. I try to keep everything DRY (don’t repeat yourself). Whenever possible I try to consolidate and join data together to get the whole picture of things that are contemporaneous, but sometimes that’s just not feasible.

What skill sets would you recommend graduate student researchers have, who are in the process of transitioning to a Data Science industry? Does domain expertise still matter in the era of data science, ML, and AI?

Domain expertise helps, but not as much as the ability to learn new things. Be good at googling. Seriously. I can’t emphasize that enough. Be good at taking notes and asking questions.

Some skills that will help you in data science will help in lots of other jobs. Work on your writing and presentation skills. Understand what makes a good slide deck persuasive to a non-technical audience.

Having engineering skills will always be useful. Learn a programming language that can be used in production. Pick something like Python, Node, Go, Rust, something that is not just Excel or R. Learn some object-oriented concepts, learn how to write unit tests, get comfortable with github.

Try to develop a sense of statistical intuition. Don’t get buried in the details. Memorizing the names of algorithms will help with interviews, but it’s a lot more useful to practice applying one or two to real-world situations, and have a portfolio project or two from beginning to end: data collection, data cleaning, analysis, visualization, presentation, ideas for next steps.

How do you define data curation?

To me, this can cover everything from data cleaning to data validation and documentation. It’s really the foundation of everything we do. It’s usually really tedious, and it’s always absolutely essential. If you don’t do it, you’re building a house of cards.

How big is the problem of messy data (missing metadata, data that is not FAIR) as a roadblock for data science? Can you discuss tricks and tips to streamline the data cleaning part of data sciences?

Messy data is the real world. There’s always missing data; there’s always stuff that’s just wrong. It’s not usually a roadblock since you can almost always find something to work with.

Tips and tricks to streamline data cleaning, no, I won’t even try to summarize that here. It’s not a tips and tricks thing. It’s a whole blogpost worth of stuff. It takes a lot of experience to be really good at it. There are plenty of easy choices you can make: drop the rows where the critical values are obviously wrong, standardize data types in each column, stuff like that. I’d say the most useful hint I can give is, try doing some exploratory analysis, do some visualizations, and ask yourself if the results make sense. Anything anomalous might be interesting, or it might be because the data is still in bad shape. Always be skeptical.

ROI & economics are not to be forgotten talking about data science. How do you calculate ROI of your data science efforts?

It depends. For some things, it’s really obvious. I can tell you one of the features I built at Yahoo was saving us >$100k per day. It definitely helps if you can calculate the actual revenue impact on a product that’s actually making money.

For other things, who knows. I did a lot of data science in my research career, although no one was calling it that at the time. I took a lot of risks doing what I thought was the most important work I could find, and it didn’t pay off for me personally. Ultimately, my work did help lead to some new cancer drugs, among other things. That made some money for people other than me, but I did the work because I wanted to help save lives. How do you put a price on that?

Are there areas which you feel are ripe to apply data science tools which have not embraced them much yet?

Not really. I’m seeing adoption in most fields, actually, it just varies in degree by geography and funding source. But there’s data everywhere. It’s just a matter of time until everyone starts using it more than we already are.

Have you had any experience with data science methods being misused because the implementer did not fully understand the methodology?

Yes. I won’t give examples because even if I didn’t name names, it would be too obvious. I’ve also seen plenty of published papers using questionable methods for data cleaning, data analysis, and statistical interpretation.

Interestingly, when I was in academia we were led to believe that industry tended to play fast and loose with data, while academic labs were always very careful. So I was somewhat surprised to learn that when there’s real money attached and really immediate consequences to the business, there’s perhaps even more incentive to validate answers and fix problems quickly.