Automated Data Cleaning and AI Pipelines for Increased Enterprise-Wide Adoption

Contributed Commentary by Will Bowers, Dotmatics

April 13, 2022 | The current data-rich era comes with a tremendous number of caveats. Large quantities of data are not necessarily high quality. Quality, in this case, can be used to describe the ease with which a user can obtain useful insights. Quality can be reduced by missing values, unnecessary repetition of samples, lack of standardization, and more. If there are any insights to be gained from a given dataset, users will have to waste a lot of time and effort to get them. Garbage in, garbage out is such a cliché by this point that it’s become known as the GIGO principle. In this article, I hope to shed some light on practices within a business that can maintain data integrity to prepare it for automated AI pipelines among other uses.

Cleaning Data

Estimates of time spent cleaning data when performing machine learning range wildly between 50% and 80% depending on the role of the individual and the source of the data. This could be for any number of reasons, such as transforming data from an instrument into a tidy, more human readable format, or cleaning an older dataset to be recycled for new insights. Whatever the reason, I’m sure we can agree that though important, data cleaning is tedious, often seemingly endless work.

To win some time back, we can automate the cleaning steps, beginning with how we collect the data. Perhaps we have a number of well-defined experiment types in our lab or institution. Then it makes sense to set up a data structure, be it a semantic-, graph-, or table-based approach, and collect the data in a consistent manner. If the users have historically preferred Excel, provide ways to upload and organize the contents of the spreadsheet to a central table. Alternatively, skip that step by allowing users to enter observations into the database via webforms. Over time, the format of the data will remain consistent, removing the need for a cleaning step.

As the organized, highly searchable data piles up, it becomes simple to run data visualizations for quick-reference dashboard or more in-depth insights. But more than this, organized data from the offset allows for the addition of Application Programming Interfaces (APIs), allowing the easy transfer of data between discrete components. Modular, dynamic components are a cornerstone of good software carpentry.

Arguably more crucial, organized data is more searchable and AI-ready. More than this, it’s high-throughput AI-ready.

The Flywheel of Research

Many models have been presented for the broad steps of discovery, but for the sake of simplicity, I’ll be using the make-test-decide (MTD) cycle. Each time a researcher cycles through these steps, other avenues are revealed, but some are also shown to be fruitless. The more cycles through these steps, the more exponential growth of avenues the user sees. Ideally, the researcher should be shown an indication of the likely success of a compound before synthesis and testing, reducing cost and time spent on poor leads.

It’s useful to have colleagues who have an intuitive understanding of AI approaches, but as with many skills and knowledge areas, understanding of theory doesn’t always transfer to practical application. The more parameters you give an experienced user, the more they can adjust them to their needs, whereas a less experienced user given the same options may create a model that predicts no better than a random coin toss.

By taking a similar approach in the chemical space, one could train a model on all the compound examined over a long project. Let’s say your division is researching how to improve minor allergy medication for a general market. Cell plate assays may focus on low toxicity, percent localized cytokine inhibition, and rate of action. You could be tagging compounds for your future reference, but that tagging also shows a training algorithm that falls in the acceptable assay range. Over time, you build a highly effective model that learns ideal functional groups, scaffolds, and general size of most effective compounds.

We must certainly consider “functional” AI; that is AI which provides valuable insights without conscious user unput. Like Netflix recommending your next favorite TV series in response to previously viewed content, the model can scour all purchasable compounds and return a series of possible compounds with similar predicted outcomes based on prior compounds in the project, thus spinning the flywheel faster. These compounds are only recommended because they are purchasable now, so right away, you can put in orders for these new candidate compounds.

This leaves space for experienced and less experienced researchers alike to think, design, innovate, and pursue lucrative areas at a drastically greater rate. The more of these leads that are assayed, the more the model learns, so its next suggestions will be even more likely to meet your needs.

Tracking and Consistency

In industry, there are business rules to meet. Across scientific disciplines, consistency allows comparison. In R&D we can’t have shadowy blends of autonomously adjusting algorithms and human tinkering here and there like we see in Social and Entertainment Media; researchers need traceable, specific models and predictions.

So, users require dashboards displaying modelling performance over time; tracking the performance of hundreds or thousands of models, aggregated so the pattern of change is clear. An honest and transparent system is vital here too. Which less-effective models need to be removed from production? Which regions of assayed molecules influence the model or models most; what structure makes a recommended compound suitable, and to what degree of certainty?

To empower researchers, we must not expect users to blindly follow, but enable them to find answers to any questions they may have with very few button presses, while giving subject matter experts access to make adjustments that improve the overall outcome.

Will Bowers, MSc, Science and Technology Specialist at Dotmatics, graduated from Imperial College London in 2018 with an MSc in Bioinformatics and Theoretical Systems Biology before commencing a PhD at the Institute of Cancer Research, defining biomarkers of tyrosine kinase inhibitor responses in soft tissue sarcoma. Will joined Dotmatics in early 2020 as a Quality Assurance Analyst, quickly progressing into the role of Data Scientist and recently as a Science and Technology Specialist with a focus on deep learning methodologies for molecular prediction and generation. Will can be reached at william.bowers@dotmatics.com.