Mutational Signatures In Cancer Improved By Supervised Machine Learning

April 14, 2021

By Deborah Borfitz

April 14, 2021 | In a perfect, patient-centric world, anyone diagnosed with cancer would know the cause—including the DNA-damaging environmental exposures, health habits, and inherited factors—based on the unique mutational pattern of their tumor, says Cristian Tomasetti, Ph.D., associate professor of oncology at Johns Hopkins Medicine with a joint appointment in biostatistics in the university’s school of public health. This will be an important job for machine learning in the future, even if “the answer will sometimes be the normal process of aging, what I call bad luck.”

Tomasetti and his colleagues at the Johns Hopkins Kimmel Cancer Center are off to a good start with their newly developed supervised machine learning algorithm that does the best job yet of connecting tissue-specific cancers to the potential disease-driving factors, he says. Details appear in a study recently published in eLife (DOI: 10.7554/eLife.61082).

The algorithm found that carcinogens like smoking and alcohol have their own, tissue-dependent mutational signature, Tomasetti says. Smoking, for example, has different signatures in bladder, head and neck, and lung cancers. Even the aging signature looks “very different” site to site. This is a significant departure from the previous methodology, he notes, where a key theoretical assumption was that a mutational process would have the same signature across all tissues.

The novel supervised approach also found that in several tissues—including the kidney, bladder, cervix, and esophagus—the smoking signature looks more like the aging signature of the same tissue rather than the smoking signatures in other tissues, suggesting that inflammation is the main effect of several carcinogens, he says.

Obesity is also back on the radar as a mutational agent leaving behind its own unique mark, says Tomasetti, adding that sequencing studies over the past few years have generally found no more mutations in the cancers of obese than nonobese patients. Using the new algorithm, mutational signatures of obesity were seen in both kidney and uterine cancers.

All this could upend findings of genome-wide sequencing studies published in recent years, since the current standard of analysis performs substantially worse than the supervised machine learning approach, says Tomasetti. In a head-to-head comparison with a widely used unsupervised machine learning algorithm (non-negative matrix factorization, or NMF), the predictive accuracy of the new algorithm was much higher.  Specifically, the median area under the curve (0 to 1 scale) for the supervised algorithm was 0.73 for aging and 0.90 for all other factors, besting the 0.57 (on par with random chance) and 0.77, respectively, for NMF, he says.

Mutational signatures identified by the new methodology have been termed SuperSigs obtained by analyzing sequencing data from 30 types of cancers recorded in The Cancer Genome Atlas and using available annotation on clinical variables such as age, smoking status, and body mass index. The research team did not pre-specify that base pairs had to be the same length, as is done with NMF, but for simplicity’s sake restricted their analysis to no more than three base pairs lengths.

SuperSigs were built on a subset of features with relatively low variance across the patient population, to improve their predictive power, says Tomasetti. Contrary to NMF, signatures were not assumed to be the same across all tissues.

Differing Approaches

The main mutational patterns of some factors like aging (e.g., C>T mutations) and smoking (e.g., C>A mutations) have been known for years. The NMF methodology, first published in 2013 (DOI: 10.1038/nature12477), made it possible to “look across the spectrum at all possible mutational changes” (the distribution) instead of just looking at only the main “peaks” (the most common mutations), says Tomasetti.

NMF provides the distribution over each of 96 possible combinations of trinucleotides associated with a mutation, he explains.

Tomasetti says he was unconvinced by the findings of the unsupervised algorithm, which extracted more than 50 distinct mutational signatures from an analysis of nearly five million mutations seen in about 7,000 cancers. For one thing, it is now well known that a sizeable number of mutations happen normally in cells independent of a carcinogen (approximately three to six mutations every time the DNA of a cell duplicates) and yet only one or two of the 50 signatures were associated with the normal process of aging, having a relatively modest contribution, he says.

In other cases, mutational signatures produced by NMF were inexplicably found too often in one cancer type, Tomasetti continues. BRCA mutations, for example, were found in practically all women with breast cancer although they are known to occur in only a subset of patients. Similarly, the smoking signature was found among all liver cancer patients when common sense suggests not all of them were either smokers or subjected to secondhand smoke.

Mutational signatures of cancer—be they for BRCA mutations and smoking or alcohol use, UV exposure, or the normal aging process—make damaging changes to DNA according to different processes, he says. “If we know that someone is a smoker and someone else is not a smoker, why not use that [kind of] information when training an algorithm to recognize this mutational signature?”

The first step for the research team was to test the existing unsupervised machine learning approach against a completely random technique to see how they compared in detecting mutational patterns that went beyond the peak for C>A mutations caused by smoking or the peak for C>T mutations caused by aging, says Tomasetti. Their performance was identical.

“That told me measured improvement was needed in this methodology, and this is important because … [it] is what is being used today for essentially all [published] genomics analysis” based on cancer sequencing data, he says.

In developing their supervised technique for the genetic analyses of cancer, the research team identified patients as young or old and their known exposures and risk factors when training the algorithm, Tomasetti continues. The standard unsupervised methodology, in contrast, notes patterns and then goes back to correlate them with exposures.

The new method also allows for the pairing of supervised and unsupervised approaches to filter out the effect of known exposures to carcinogens to explore the effects of potential unknown factors, he adds.

Potential Impact

For now, SuperSigs are common patterns across patients having the same cancer type rather than being patient specific, Tomasetti says. But, as with the predecessor mutational signatures, their presence and amount are specific to individuals.

One long-term goal is to better understand the etiology of cancer, says Tomasetti. Physicians can only get so much clinical information by interviewing patients, including if they smoke and how much. “There are many exposures [to carcinogens] that patients may not even know about. And how can a patient remember the intensity of an exposure over a lifetime?”

A view of the signature of smoking or alcohol use in actual DNA is in any case likely to be impactful in clinical decision-making, he says. In the future, SuperSigs should help enable treatment choices to be personalized based on how similar patients have fared on different therapies.

Early detection may also need to be the focus of greater effort and resources, Tomasetti says. In the recent study, the research team determined that 69% of mutations found in cancer patients across all tumor types can be attributed to randomly occurring mutations that are part of the normal endogenous mutational processes associated with aging.

The entire software package for new supervised machine learning algorithm will be made available to the wider research community on the open source Bioconductor platform soon, Tomasetti says. A repository for the R package containing the SuperSigs algorithm is already available on GitHub. As he is aware, the new approach must be an easy-to-use tool for everyday researchers. “Otherwise, they’ll revert back to the standard and not-as-good technique just because it is simpler to adopt.”

Tomasetti says he hopes to collaborate with groups using the previous methodology, including developers of the unsupervised machine learning algorithm.