Big Data, Supercomputing Deployed For Disease Risk Prediction Models
By Deborah Borfitz
November 12, 2020 | The Department of Veterans Affairs (VA) and the Department of Energy (DOE) have teamed up to develop disease risk prediction models using deep learning, according to Ravi Madduri, a computer scientist at the DOE’s Argonne National Laboratory and co-lead of the collaboration. He presented on two of three projects—specific to cardiovascular disease (CVD), prostate cancer, and the prevention of suicide ideation—undertaken by the Computational Health Analytics for Medical Precision to Improve Outcomes Now (MVP-CHAMPION) partnership at the recent Bio-IT World Conference & Expo Virtual.
The mission of MVP-CHAMPION is “to improve the health and wellbeing of our veterans, and the population as a whole,” Madduri says, by better understanding underlying causes of diseases, hereditary factors and health history using big data, high-performance computing, and data science.
The VA has a massive amount of genetic data from its Million Veteran Program as well as phenotypic data in a Corporate Data Warehouse (CDW), and also deep informatics and domain knowledge, he adds. The DOE, meanwhile, manages the world’s fastest supercomputer and has the skillset to manage the data on an exabyte scale following FAIR (findable, accessible, interoperable, and reusable) guiding principles. Biomedical data are more recent additions to its modeling and simulation work, Madduri notes.
Data used by MVP-CHAMPION comes from 20 years’ worth of paperless electronic health records (EHR) at the VA on 30 million vets, plus associated germline genetic data from the Million Veteran Program, he says. “It’s one of the world’s largest biobanks, [containing] a treasure trove of information.”
CVD, being the biggest killer of vets, was one of the priority areas identified at initial ideation meetings held several years ago, says Madduri. The goal with this project is to characterize phenotypic patterns of disease occurrence, identify genetic signature of outcomes for different racial and ethnic groups, and to integrate phenotypic data and genomic signals to improve prediction of incident CVD outcomes.
The overall plan was to design CVD outcome variables that could be used to develop a predictive model, and to improve the definitions for acute myocardial infarction (AMI) and stroke, Madduri says. Models would be used in a study to predict outcome variables for AMI and stroke as well as death.
Much of the work currently underway is around cleaning up and reducing biases in the data to make it available for modeling, says Madduri. Many of the baseline calculations have already been done, including ones specific to American Heart Association risk scores for CVD prediction on the MVP cohort, genome-wide association studies (GWAS) for varicose veins, models to predict AMI in CDW cohorts with 1,000 diagnostic codes, and identification of CVD subtypes for multi-task outcome prediction—including hypertension, coronary artery disease, atrial fibrillation, conduction disorders, heart failure, peripheral artery disease, valvular disease, stroke, carotid stenosis, venous thromboembolism, MI, and abdominal aortic embolism.
The initial data cube for large-scale deep learning within the MVP cohort used data on 361,581 vets for training and another 90,398 for validation, he adds. It is now being applied to risk prediction.
Prostate Cancer Model
Prostate cancer is the most common solid organ malignancy in American men and the prognosis has been difficult to pin down, says Madduri. The goal with this second project is to take advantage of EHR data on the diverse population in the MVP cohort to develop a model that can be integrated into a clinical decision support tool. A performance assessment for discrimination, calibration and clinical usefulness will be periodically conducted on the model to “guarantee transparency,” he adds.
Variables in the prostate cancer risk prediction model include number of positive scores, Gleason (grading) score, race, month and year of diagnosis, and repeated measures prior to biopsy (i.e., PSA, testosterone and other routine labs), Madduri says. The plan is to expand the list to include genomics data (DNA-repair genes, germline genetics GWAS, and genetic risk scores), text (radiology reports and progress notes) and images (MRI and ultrasound) to improve the model.
Madduri ended his presentation with a plug for the Argonne Biomedical Learning Enclave (ABLE), which securely stores and analyzes data for the “AI-mediated medical revolution.” ABLE is a computation capability that combines big data, artificial intelligence (AI), and high-performance computing and looks to take advantage of public-private partnerships and develop a workforce capable of computing to tackle health problems.