Small Data Finding Could Help Big Data Quality

By Sean Ekins

May 15, 2013 | Guest Commentary | I get to have the most fun when someone wants to collaborate on a crazy idea. But crazy ideas come with unexpected challenges, too.

Last year I embarked on one such project that stemmed from an editorial and blogs on errors in structure and database quality with Antony J. Williams from the Royal Society of Chemistry. Joe Olechno from Labcyte saw our work and wondered whether other factors could be at play to affect data in databases.

After some emails and a meeting around the Tri-Conference* last year, we proposed a few computational analyses to look at the effect of dispensing compounds with tip-based or acoustic approaches, solely using some patents from AstraZeneca on a kinase project. Within a short time we had completed and written up the study showing how tip-based dispensing provided different computational models and conclusions about the target than acoustic dispensing.

The journal review process, though, took longer than the analyses, which was surprising as we felt that our observation was of general interest.

Dispensing compounds is something many take for granted regardless. Think of all the types of data and samples that are handled in pharma, biotech, hospitals, and CROs. Yet dispensing technique can make a significant difference in results.

While big pharma may be aware of such issues, as evidenced by their shift to different dispensing tools, we would argue that academia is years behind. Just consider all the academic screening centers that are now taking up the slack as pharma pushes much of the earlier high throughput screening out the door and into academia’s lap.

The average scientist is naïve to the effect of dispensing technique on their small datasets. The average computational modeler that relies on PubChem, ChEMBL or some other public or commercial database for their big data, rarely considers how the data was dispensed.

Big data is the topic of much conversation right now, but the data quality angle has been totally ignored. Yet it could be critical for any insights we try to gather from our datasets.

Our paper was published in PLOS ONE a week ago, and has since been picked up by bloggers like Derek Lowe at “In the Pipeline” and featured in the press. We would not have predicted the polarizing effect it has had.

Our inboxes have been full of opinions that have veered from dismissal to amazement.

“… It scared the cr*p out of her for some reason…”

“While it's an interesting issue, I have to say, the paper is rather weak.”

“Not surprised... I did never do this specific comparison..”

“In my entire career I only succeeded once in getting such a 'crazy outcome' published.”

“..is something that people should be aware of when using data.”

I am under no illusions, there are many limitations to the paper, and as many questions still to be answered as we tried to address. But this topic might have been tucked away at high throughput screening conferences, or proprietary knowledge in a few pharmas, and that is how it might have remained.

Our finding with a small dataset generated by a large pharma could ultimately improve big data quality. The work shows the value of collaboration and persistence and what it can do for the scientific community, and certainly points to future fundamental science that still needs to be done.

Sean Ekins is Senior Consultant for Collaborations in Chemistry and VP Science at Collaborative Drug Discovery Inc. A full bio can be found here. He can be reached at ekinssean@yahoo.com

* Molecular Medicine TriConference, Cambridge Healthtech Institute, San Francisco. www.triconference.com