Spanish Team Builds Neural Network to Predict Small Molecule Characteristics

August 11, 2021

By Allison Proffitt

August 11, 2021 | A team of researchers in Barcelona have gathered bioactivity information for a million molecules using deep machine-learning computational models and a database of experimental results. Both the experimental results and the machine learning tool are available to the community at

The work originated with the Structural Bioinformatics and Network Biology laboratory at the Institute for Research in Biomedicine (IRB) in Barcelona, Spain. In May 2020, the team published in Nature Biotechnology an integration of the major chemogenomics and drug databases including ChEMBL and DrugBank (DOI: 10.1038/s41587-020-0502-7). The result is Chemical Checker (CC), a database that includes processed, harmonized, and integrated bioactivity data on more than 800,000 small molecules.

Data are included from 25 spaces of bioactivity grouped into five categories: chemistry, targets, networks, cells, and clinics. These parameters include the chemical structure of the molecule, the targets with which it interacts, or the changes it induces at the clinical or cellular level. Chemical Checker follows a “chemistry-to-clinics rationale that facilitates the selection of relevant signature classes at each step of the drug discovery pipeline,” the authors wrote.

The 5 × 5 organization was designed to capture distinct aspects of the chemistry and biology of compounds, and a systematic assessment of the original (experimental) resource revealed partial correlations between the 25 data types

While CC is very useful for well-described molecules, many molecules remain poorly-characterized. Chemistry signatures are widely available, but cell-based assays cover only about 30,000 molecules, and clinical signatures are known for only a few thousand drugs.

The researchers sought to infer bioactivity signatures based on their observation that different bioactivity spaces are not completely independent, “and thus similarities of a given bioactivity type (e.g., targets) can be transferred to other data kinds (e.g., therapeutic indications).” This work was published last month in Nature Communications (DOI: 10.1038/s41467-021-24150-4).

Filling in Holes

The team used a Siamese Neural Network (SNN) to integrate the experimental information available and then predict the activity profiles of molecules.

“We feed the SNN with triplets of molecules (an anchor molecule, one that is similar to the anchor (positive) and one that is not (negative)), and we ask the SNN to correctly classify this pattern with a distance measurement performed in the embedding space,” the authors write. They trained 25 SNNs, one for each of the bioactivity spaces in the Chemical Checker, and called each score an SNN “signaturizer”.

The bioactivity data predicted by the model have a greater or lesser degree of reliability depending on various factors, including the volume of experimental data available and the characteristics of the molecule.

For instance, predictions for chemistry signatures—while generally unnecessary—were also often correct. Target and Network prediction scores were also generally high. “Not surprisingly, the most challenging models were those related to cell-based and clinical data, probably due to the inherent complexity of these data with respect to the number of annotated molecules,” the authors wrote.

To validate the tool, the researchers searched the library of compounds at IRB Barcelona for those that could be good drug candidates to modulate the activity of a cancer-related transcription factor (SNAIL1), whose activity is almost impossible to modulate due to the direct binding of drugs (it is considered an 'undruggable' target). Of a first set of 17,000 compounds, deep machine learning models predicted characteristics (in their dynamics, interaction with target cells and proteins, etc.) for 131 that fit the target.

The ability of these compounds to degrade SNAIL1 has been confirmed experimentally and it has been observed that, for a high percentage, this degradation capacity is consistent with what the models had predicted, thus validating the system.

"The new tool also allows us to forecast the bioactivity spaces of new molecules, and this is crucial in the drug discovery process as we can select the most suitable candidates and discard those that, for one reason or another, would not work," explained Patrick Aloy, senior author on the paper in a press release about the work.

The team made the software library freely accessible to the scientific community at and plans to regularly update it as more biological activity data become available. With each update of experimental data in the Chemical Checker, artificial neural networks will also be revised to refine the estimates.

“CC signatures evolve with time as bioactivity measurements accumulate in the databases. We will release updated versions of the signaturizers once a year and, as developers of the CC, we are committed to keeping abreast of the latest phenotypic screening technologies and chemogenomics data sets,” the authors write.

“The ever-growing nature of chemical matter (in contrast to the finite number of genes) demands computational methods to provide a first estimate of the biological properties of compounds. We believe that CC signaturizers can bridge this gap and become a reference tool to scrutinize the expected bioactivity spectrum of compounds,” they continue.