‘Federated Analysis’ Helps Fill Genetic Risk Knowledge Gap On BRCA Genes
By Deborah Borfitz
June 14, 2022 | A significantly heightened risk of breast cancer has two well-known culprits—mutations to the tumor suppressor genes BRCA1 and BRCA2—but more than 40% of the time they are black-box “variants of uncertain significance.” Using a new data-sharing innovation known as federated analysis, however, an international team of researchers has categorized 16 of these uncertain variants as benign or likely benign so their carriers may be able to have their clinical risk of disease managed properly, and avoid the risk of invasive and irrevocable surgeries, according to Melissa Cline, research scientist at the University of California, Santa Cruz (UCSC).
The holdup in classifying the variants is largely due to privacy laws making the patient-level data needed for their interpretation “virtually impossible to share outright,” she says. Federated analysis gets around this problem by “bringing the code to the data,” which remains securely at the home institution.
As recently reported in Cell Genomics (DOI: 10.1016/j.xgen.2022.100109), the approach was used to analyze breast cancer clinical data at Japan’s RIKEN Center for Integrative Medical Sciences. The exercise was designed to allow for the responsible sharing of genomic data for the benefit of human health, a core charter of the international standards-setting organization Global Alliance for Genomics and Health (GA4GH).
Cline, with Amanda Spurdle, a cancer epidemiologist at QIMR Berghofer Medical Research Institute (near Brisbane, Australia), seven years ago cofounded the BRCA Exchange. A driver project of the GA4GH, the BRCA Exchange is endeavoring to identify the variants that cause harm, format the data so everyone can understand it, and share this crucial information with patients and clinicians, Cline explains.
About 80 to 90 genes (depending on the counting method) are recognized risk factors on hereditary cancer panels used by genetic testing labs, says Cline. BRCA1 and BRCA2 are among the most worrisome “high penetrance” cancer susceptibility genes. While the lifetime risk of breast cancer is, on average, one in eight, for people carrying a pathogenic BRCA variant that risk rises closer to two out of three, she notes.
“There are stories of families that carry one of these very high penetrance variants and while there are high rates of cancer in the family overall, there is the grandmother who carries exactly the same variant and lives to the age of 85 and dies of natural causes,” continues Cline, referencing an example described in Jeff Wheelwright’s 2012 book, “The Wandering Gene and the Indian Princess: Race, Religion, and DNA.” The same goes for other high penetrance genes, notably TP53 responsible for repairing DNA damage.
“When you have one functional copy and not two, you’re basically repairing DNA damage a lot less efficiently and that damage accumulates, which can lead to cancer,” Cline says. As with BRCA1 and BRCA2, it is extremely rare to have two non-functional copies of TP53 since it can be lethal to a developing embryo. In the case of the BRCA2 gene, two non-functioning copies results in a rare childhood disease called Fanconi Anemia that has an average life expectancy of about 20 years.
Bot Visits the Data
Most genetic variants that lead to greater disease risk are rare, and most variant interpretation requires evidence that the variant impacts function—e.g., an in silico prediction score as well as clinical observations of patients who carry the variant, she continues. “When you have a rare variant, there is no single institution that is likely to gather enough clinical observations to really come up with a robust interpretation, and so data sharing would seem like the natural solution.”
However, databases of people’s genomes and medical records are locked away by restrictive data privacy laws such as the EU’s General Data Protection Regulation (GDPR) and infractions come with hefty penalties. “Most patient data in the world sits in a secure environment in some institution that is not at liberty to share it,” Cline says.
Fortunately, the information needed for variant interpretation is not the patient records themselves, she continues, but summaries of a set of observations on patients with the same variant. For the recent collaboration with RIKEN, federated analysis was done on breast cancer clinical data derived from the BioBank Japan repository.
The federated method uses a “pipelines to the data” model allowing data custodians to retain control over data use and access, with no time-consuming data transfers required, says Cline. RIKEN didn’t have the resources to code up the needed analysis software themselves, so a Docker container (a type of bot) was sent from California to visit the data and run a series of tests.
The Docker container packages together software and its dependencies, making it much easier to install and execute on a different computer system, she adds. It generated variant-level summary data on patients who carry the harmful BRCA mutations, which was no longer personal data.
RIKEN was therefore able to share the information more openly, continues Cline. The data could leave the institution and go to UCSC and expert variant curators in Australia.
“The GA4GH data standards are really what makes this all possible,” Cline says. She and Spurdle are among those who test-drove the data standards, including development of the GA4GH Variant Representation Specification, as covered in a special issue of Cell Genomics (DOI: 10.1016/j.xgen.2021.100029) last fall.
In 2015, the GA4GH launched a consortium called the BRCA Challenge to pull together a team of experts—including geneticists, cancer epidemiologists, bioethicists, and research clinicians—to directly tackle the data-sharing obstacles for the two genes of clearcut medical importance, says Cline. It was to be an “exemplar for other genes and heritable conditions.”
At the time, three major databases contained information on the BRCA variants that are clinically important and there was very little overlap between them, she says. “Until you agree on what variants you need to talk about, you can’t really get any farther, so the BRCA Exchange was the first work product of the BRCA Challenge, and its goal is to bring together and harmonize BRCA data from different public sources.”
Today, the BRCA Exchange is the world’s largest public source of knowledge on BRCA variants, Cline says. “We have about 68,466 variants in our database, and there were 11,923 when we started in October 2016.”
This gets back to the clinical problem of variants of uncertain significance. Not knowing categorically that a BRCA variant is harmful, harmless, or somewhere in between, patients and their doctors will often err on the side of caution, leading to the risk of overtreatment—meaning surgical removal of their breasts and ovaries.
The BRCA Exchange model will be extended to other genes, including CDH1, a significant risk factor for heritable gastric cancer susceptibility, says Cline. She is collaborating with the nonprofit No Stomach For Cancer to launch a portal for the high penetrance gene.
On all such projects to date, to ensure that the Docker container used for data sharing follows the rules on regulatory and policy matters, Cline and her colleagues have turned for guidance to Adrian Thorogood, the former GA4GH regulatory & ethics work stream manager who is now a research and development specialist in law and ethics at the University of Luxembourg. He is not only well-versed on policy, but also can speak intelligently on technology and genomics, Cline points out.
For the foreseeable future, a consultant of this type will be necessary to verify that the data analysis safeguards patient privacy—a subject of considerable debate, especially when it comes to the GDPR, says Cline.
To help institutions with limited in-house bioinformatics support get up and running quickly with the GA4GH standards and cloud and workflow APIs, the GA4GH has developed “starter kits” for researchers, announces Cline. The kits should enable more institutions to bring their data online and share, via federated analysis, in a well understood fashion.
Only days ago, RIKEN investigator Yukihide Momozawa emailed Cline saying he had more data he’d be able to share the results on, says Cline. “It would have to be another federated analysis because their own postdocs are over-committed and, outside of the postdocs, there are folks who can run Docker containers but cannot necessarily implement algorithms.”
Federated analysis projects won’t be limited to larger institutions with the financial wherewithal to conduct a study, she adds. In time, global-scale data integration initiatives will also be taken on by the Genomics in Health Implementation Forum supporting uptake of the GA4GH standards.
“Countries that have populations that are underrepresented in genomic databases could leverage the GA4GH technologies to get more knowledge out there on their populations,” says Cline, citing GEnome Medical alliance Japan (GEM Japan), another driver project of GA4GH. Established by the Japan Agency for Medical Research and Development, GEM Japan stands to benefit down the road by having Japanese genetics better understood medically.
“It is said that the human genetic diversity within Africa is greater than the human genetic diversity in the rest of the world,” she additionally notes. H3Africa, yet another GA4GH driver project, is a sequencing initiative endeavoring to capture that diversity.
Principal investigators on the project say that some known disease-causing genetic variants are being seen at such a high frequency that they seem unlikely to be causing disease in Africans, at least not at the same rate as Europeans, says Cline. The prevailing hypothesis is that something in the genome of these populations is conferring a protective effect.
Although federated analysis remains uncommon, its future looks bright overall, Cline says. The Treehouse Childhood Cancer Initiative, a sister project at UCSC, is using federated analysis to analyze RNA sequencing data from the tumors of pediatric cancer patients to compare their gene expression profiles to data already in a compendium to look for similarities and opportunities to repurpose existing drugs.
Since the RNA-seq data itself cannot be shared, she says, hospitals are using Docker containers to generate the gene expression levels for Treehouse. When the groundbreaking project launched a few years ago, she notes, a collaborator in Canada needed the approval of the Canadian Ministry of Finance to participate.
Federation is poised to help fill the genetic risk knowledge gap for multiple other health issues beyond cancer, including heart conditions and infectious diseases, she says. The approach is being used to build large networks for the Canadian CanDIG as well as African, Canadian, and European CINECA projects.