Broad and Novartis Keep Up with Big Data

By Matt Luchette

May 15, 2014 | In the era of big data, drug developers are pushing for data analysis to keep up with increased data acquisition. At this April’s Bio-IT World Conference & Expo, Andrea DeSouza, from the Broad Institute, and Anne Mai Wassermann, from Novartis, spoke on new technologies they hope will squeeze meaning from researchers’ mounds of data.

At the Broad Institute, DeSouza, the Director of Informatics, Data Analysis, and Finance, noticed a bottleneck: scientists studying molecular probes, compounds that help characterize biological pathways and spur drug development, were producing data faster than it was being analyzed. These probes were often assayed at multiple centers, and to share their data, researchers had to write scripts that would annotate their results, taking away time they could otherwise use for data analysis.

DeSouza’s goal, she explained, was to help scientists find “that meaningful data set so you can speed up hypothesis generation.”

To open the bottleneck, DeSouza, with a team from seven other research institutes, developed the BioAssay Research Database (BARD). The project, De Souza explained, had three aims: to understand the data scientists were producing, clean up the data, and “mask the complexity from the scientists in the lab.” By annotating the data and making it more accessible, DeSouza hopes BARD will improve communication within multicenter projects and speed hypothesis development.

Starting in early 2012, BARD was built as an open source program to help scientists query data from the Molecular Libraries Program, an NIH-funded initiative to accelerate probe discovery. Today, BARD recognizes over 2,000 assay definitions, houses data from over 3,000 experiments and 100 molecular probes, and supports 15 plug-ins for further data analysis.

A key challenge in developing the program, DeSouza explained, was handling the vast diversity of terms scientists use to report their results. While analyzing the PubChem database, for instance, the team found that scientists used 1,800 different phrases just to represent the terms “percent inhibition” and “10 uM.” How could the program annotate data, when everyone reports data differently?

To handle the diversity, BARD controls the vocabulary scientists use for reporting their results. But it hasn’t been easy, says DeSouza. “The harder part was getting [scientists] to engage with [BARD] as they were cleaning up the data.” If scientists were going to use BARD, the program needed to be nimble and quickly adjust to how scientists wanted to use it. “Don’t stand in the [scientists’] way of a new term being added to the system,” DeSouza concluded.

A Vision for Drug Candidate Analysis

Where BARD aims to improve analysis across multiple institutions, companies like Novartis have been improving data visualization within their own company. In a talk shortly before DeSouza’s, Dr. Anne Mai Wassermann, a researcher at Novartis, spoke about the company’s suite of three visualization tools: HTS-Explorer, Chemotography, and ConTour. The tools work synergistically to help researchers identify promising drug candidates.

HTS-Explorer, the base of the suite, is aimed at medicinal chemists, said Wassermann. A chemist testing a compound’s activity against a specific protein may be interested in how similar compounds have fared in the past. Explorer lumps Novartis’s compounds into chemical classes, and colors the compounds based on how they have performed in previous screens against the same protein family. The program can also be run in Spotfire for further data visualization.

“It’s the best of both worlds,” said Wassermann. “Flexible data visualization from Spotfire, as well as data annotations from HTS-Explorer.”

But the same researcher may also be interested in the chemical similarity of compounds that affect a particular biological pathway. Like Explorer, Chemotography encodes chemical similarity by color, but overlays those colors on the scientist’s target pathway. The tool is meant to illustrate the diversity of compounds that hit certain pathway elements. Chemical classes that affect a particular target more than others could point the development team towards more promising drug candidates.

However, chemical classes aren’t the full picture – chemically similar compounds may have different biological activities. Similarly, some compounds may have a similar mechanism of action, without being in the same chemical class. ConTour clusters compounds based on their biological activity, and like Chemotography’s chemical classes, shows which clusters may selectively affect a specific target.

Hit assessment isn’t just about picking the most potent compound from a screen, Wassermann explained. Assays only represent a part of the biological picture. Novartis’s visualization tools hope to accelerate drug development by understanding how candidate compounds work in their biological context.

Both BARD and Novartis’s suite are built to be seamless tools that help scientists’ data analysis keep step with data production. But in building BARD, DeSouza recognized how challenging it is for scientists to communicate their data outside of their own lab notebooks. Much of BARD’s initial development depended on student interns who entered data for the scientists. DeSouza hopes the project helped the students understand the importance of high quality, well annotated, and easily communicated data in their own careers. “Without the help of the students,” she said, “I’m not sure we would be where we are today.”