An AI Tool for Lung Cancer Diagnosis in Indian Populations, Settings

August 3, 2023

By Allison Proffitt 

August 3, 2023 | The prognosis of lung carcinoma has changed since the discovery of molecular targets and their specific drugs. Somatic EGFR mutations have been reported in lung carcinoma, and these mutant proteins act as substrates for targeted therapies. However, in a resource-constrained country like India, panel-based next-generation sequencing cannot be made available to the population at large.  

At the recent Bio-IT World Conference & Expo, Kamal Rawal, Professor and Head Center for Computational Biology and Bioinformatics, at Amity University in Noida, India, outlined efforts to create a web-based tool that can detect lung cancer from CT images and—using machine learning methods--predict cancer progression and stage as well as EGFR mutation status.   

Most lung cancer prediction tools focus on lung nodule prediction, but in India, Rawal said, “we can’t have panel-based next-generation sequencing available to large populations for screening purposes.” Yet there’s still a need for low-cost and non-invasive diagnostic options.  

He and his team scanned the current lung cancer nodule prediction tools including LungCAD, LCP-CNN, LUNA16, DeepLung, and more. He found that they most suffered from high false positive rates, some were unable to detect nodules in peripheral and apical regions, and difficulty detecting small nodules.  

Rawal proposed a solution that would be interpretable by clinicians and approved for use by pulmonologists in low-resourced areas. He prioritized reproducibility, a web-based user interface, and an extensive dataset featuring heterogenous Indian and White patient samples.  

Dataset Diversity 

Having a strong representation of Indian samples was important to Rawal because previous studies had featured more White and Chinese patient populations. So he built five sample cohorts. Three cohorts comprised 2,066 Indian patient samples gathered from the Rajiv Gandhi Cancer Institute & Research Centre in New Delhi. One cohort comprised 211 white patient samples from The Cancer Imaging Archive. The training dataset comprised 1,010 patients from the Lung Image Database Consortium. They acquired 315,000 images from various open sources; 129,000 images were unique to this effort from Indian cancer patients.  

Using these various data, the team began to build AI-based Predictive System-Nodule, or AIPS-N. Starting with annotated CT lung images, they trained AIPS-N to automate lung-segmentation and nodule property prediction. They added the CT images from the four patient cohorts and generated the AIPS-N model and scores. Finally they added the clinical factors for the four patient cohorts—including age, gender, smoking status, and histology—and trained the AI-based Predictive System-Mutant model to predict EGRF mutational status.  

The team did not create all of their own machine learning approaches themselves, instead borrowing existing tools. “We tried out different deep learning systems,” Rawal said. For example, to extract regions of interest from lung CT images for the AIPS-N model, the team settled on Facebook Research’s Detectron2 Faster R-CNN R101-FPN116. The AIPS-N model scores images on nodule sphericity, margin, texture, malignancy, and spiculation and AIPS-M predicts EGFR mutation status.  

“Once we built the system we started evaluating, we started testing, we started validation on independent datasets,” Rawal said. The team compared the performance (accuracy, performance, recall, etc) of various machine learning algorithm—GridSearch CV, Randomized Search CV, Decision Tree Classifier, etc.—on separate cohorts of patient samples, comparing and testing across the four cohorts of Indian and White patient samples. The model is now scoring highly—about 85% on precision, 88% on accuracy, 98% on recall—performing “very close to how clinicians are thinking, how radiologists are thinking, how humans are thinking,” Rawal said.  

The team has built a prototype user interface for academic use: They are also working on building a large language model (LLM) that generate automatic medical reports for doctors or patients. “For everything, we are automating it, so this will be a good value add for any hospital, radiologist, oncologist,” Rawal explained.  

Next Steps 

The work is just beginning, Rawal explained. Currently the APIS-M model is trained only on EGFR mutation data. He’s like to add genomic data on other genes including ALK, KRAS, and ROS. A method of simultaneously analyzing numerous target genes will pave the way toward saving valuable resources in resource-limited settings, he said. He’s also looking forward to including even more Indian population data to further improve performance.