Takeda On Foundational, Business-Driven, Enabling Data Science

October 15, 2018 | At the 2018 Bio-IT World Conference & Expo, we dedicated the Wednesday morning plenary session to a panel discussion on data science. It was rich conversation. Panelists laid out their own definitions of “data science”, discussed how to build and structure a data science team, and debated what role expertise (data science or domain) should play in experimental design. As we talked, we collected questions from the audience—more than we were able to use. Instead of abandoning the community’s contributions, I’ve collected the top voted questions from the panel and offered them back to our community. Want to weigh in? Download the question list and join the conversation. Allison Proffitt, Editor

Krista McKee

Director, Data & Analytics

On behalf of Takeda Data Science Institute’s Data Architecture and Technology Team

Takeda Data Sciences Institute

McKee-Takeda

Bio-IT World: What is the optimal environment to grow data science?

Krista McKee: Data science should be foundational, business-driven, and enabling within an organization. Foundational in that it should ensure an organization has efficient and appropriate access to the right data, both internal and external. Business-driven in that it should be acutely aware of the business needs and positioned to define and deliver innovative, fit-for-purpose data strategies to address business questions. And enabling in that it should push an organization beyond its traditional ways of working to a more efficient, data-driven future.

What are the different levels of maturity in data science in a lab and what does the roadmap look like?

Data science is an old concept that has become a new discipline. I like to think of this growing branch of learning through my understanding/interpretation of psychologist Richard Barrett’s Ego-Soul Dynamic.

The nascent data science group is in the stage of Ego Development. They are trying to figure out the world in which they live, or the organization in which they operate. They are, at minimum, surviving while also working to establish their utility within the organization.

The more mature data science group is in the stage of Ego-Soul Alignment. This is when a group is less dependent on the surrounding organizational culture and the resulting feedback loop. Rather, it brings an independent purpose-driven vision and accepts broad responsibility and accountability.

The most mature data science group is in the stage of Soul Emergence. This is when a group is maximally positioned to broadly collaborate within and serve an organization, leading to large, positive outcomes.

For any organization, the data science road map should be practical and immediately helpful while also bold and visionary.

How do you generalize processes for data science and analytics when working with heterogeneous (and always evolving) data types?

Within Takeda R&D, we have created a data quality and governance playbook to support our overall Data Sciences Strategy/Plan. Through the socialization/implementation of this playbook, we expect to:

Foster a data-driven culture,
Enable common understanding of how R&D data supports business needs,
Establish clearly-defined data governance policies,
Define ownership and accountability by Data Domains, and
Provide a scalable framework for future data needs.

We have also been evolving towards a lean data lake infrastructure that allows for more flexible data ingestion processes and data consumption methods beyond the schema-on-read ELT approach. The lean approach leverages the AWS S3 instead of HDFS as the foundation for the data lake to accommodate different data types and their constantly-evolving nature.

What skill sets would you recommend graduate student researchers have, who are in the process of transitioning to a Data Science industry? Does domain expertise still matter in the era of data science, ML, and AI?

Data science requires a combination of biomedical/industry understanding, big data engineering know how, and mathematical/statistical training. In today’s biopharma industry, I see data science as the necessary balance on the scale. Domain expertise will always be important, but it is too heavily depended upon. Technological capabilities and data will continue to grow exponentially, and it is up to data scientists to work with the domain subject matter experts to more broadly utilize data to bring greater business insight and, ultimately, create more efficient organizations.

How do you define data curation?

Data curation is the thoughtful and deliberate management of data, including cleansing, transformation and proper annotation / cataloguing. It ensures that data can be effectively consumed and integrated for business insight and organizational decision-making.

How big is the problem of messy data (missing metadata, data that is not FAIR) as a roadblock for data science? Can you discuss tricks and tips to streamline the data cleaning part of data sciences?

It is a problem and is exactly why data sciences needs to be foundational to an organization. On their own, organizations will continue to think of data in functional silos. Data science brings the horizontal perspective to an organization’s data. A data sciences group must be persistent in its foundational aspects and work to establish a data culture. Machine learning can also help with the messy data problem. We need to continue to get better in our ability to use machine learning algorithms and applications to answer important questions for our organizations.

ROI & economics are not to be forgotten talking about data science. How do you calculate ROI of your data science efforts?

For each data sciences project we create an impact statement. The impact statement articulates new capabilities, new learnings/awareness, efficiency gains, and cost savings for the organization.

Are there areas which you feel are ripe to apply data science tools which have not embraced them much yet?

Clinical Development Plans/Clinical Trials. I would like see data science strategies and digital tools more broadly and systematically utilized in in the planning and execution of today’s clinical development programs. We also need to continue to make progress on more experimental technologies, such as sensors, that don’t have a firm place in human experimentation but have tremendous potential.

Have you had any experience with data science methods being misused because the implementer did not fully understand the methodology?

I don’t have a specific example, but this is an important consideration. Data science methods must be thoughtfully and appropriately applied to any business problem. This is impossible for a data scientist to do in vacuum. For any project, there must be considerable interaction with the relevant subject matter experts to ensure the questions are clearly articulated, the data sources and methodologies are appropriately selected, and the limitations of both the data and the methodologies are understood. Overall, data science is very dynamic field and the biopharma industry community should continue to work to maximally leverage the data science potential.