Data Diversity In Precision Medicine

June 26, 2017

By Paul Nicolaus

June 26, 2017 | When Heidi Rehm discovers a genetic variant in a patient, one of the first steps she takes to figure out the underlying cause of the illness is to look up how common that variant is. If 10% of the general population has a particular variant, then it cannot cause a disease found in only 1 in 200,000 people, explained the Harvard Medical School associate professor of pathology and director of the Laboratory for Molecular Medicine at Partners Healthcare Personalized Medicine.

Access to databases supplying information on the rarity, or allelic frequency, of variants is a relatively new phenomenon. Researchers used to be limited to any cases that had previously been run within their own lab, and in most of those instances there was no control population. "There's been a gradual development of resources," Rehm said.

While the dbSNP, a database of single nucleotide polymorphisms, was published back in 2001, a variety of other options have since come about. The 1000 Genomes Project data was made available in phases, and the NHLBI GO Exome Sequencing Project (ESP) upped the data ante sixfold. The Exome Aggregation Consortium (ExAC), released in 2014, scaled from 6,000 to over 60,000. And Kaviar, available for querying and downloading since 2010, integrates publicly available datasets, including a downloadable version of ExAC.

Perhaps just as important as the growth in the sheer volume of information, however, is the expansion of the diversity of that data. Because variant frequency can vary by ethnicity or geographic region, these databases are most useful when they include an array of ethnic groups. The lack of a large, diverse collection of data for use as a reference set has been a roadblock for genome interpretation and precision medicine, but there have been improvements. Whereas African Americans were the only non-caucasians included in ESP, 1000 Genomes incorporated an array of small ethnic populations followed by ExAC's spike in both numbers and diversity.

Great Expectations

Even more recently, the Allele Frequency Community (AFC) was formed as yet another means of sharing data on genetic variants. Gustavo Glusman, principal scientist at the Institute for Systems Biology and creator of Kaviar, said his involvement in the AFC first began when he discussed the concept with former Ingenuity Systems employee Nathan Pearson at the Advances in Genome Biology and Technology (AGBT) meeting held in 2013.

As Bio-IT World previously reported, when the idea came to fruition in 2015, a mix of 13 academic centers and commercial entities contributed data from over 70,000 human exomes and whole genomes to the project–with more than 100 countries represented–to create an ethnically diverse community database of allele frequencies.

People all over the world have been loading up their data and using Ingenuity Variant Analysis (which joined QIAGEN's portfolio in 2013 following the acquisition of Ingenuity Systems), and all this information has been building up over the last several years, explained Rehm, who was involved in the AFC early on as she helped launch and promote the benefit of the community. "So QIAGEN decided to ask people if they were simply willing to let that data be used toward the calculation of allele frequencies," she said.

By contributing data to the AFC, participants gain the aggregated and ancestry-specific allele frequency estimates for their variant assessment purposes, noted Dan Richards, QIAGEN’s VP of Informatics. QIAGEN computes and maintains the data, handling aspects such as duplicate samples, family relationships, quality control, somatic versus germline distinction, known pathogenic variants, deidentification, and ancestry inference.

Unlike large, clinically focused public variant databases like ClinVar, the AFC does not manually interpret the clinical significance of variants. The goal, rather, is to offer members information on the incidence of variants across the general population that can be used as a first-pass filter when determining whether an unknown variant could be responsible for a hereditary disorder.

The advantage, Glusman said, was that it made the process of sharing data very simple. There are plenty of researchers who make use of the QIAGEN platform, and if those individuals already have their information in the system it is easy for them to mark a checkbox and make their data available for computing allele frequencies.

Some Limitations

There are also several potential drawbacks realized since inception, however. One of the main issues is that the sequencing of AFC data was handled on different platforms. “You can’t get good quality metrics because there’s such diversity of underlying platforms, and I don’t think there’s been good capture of phenotypes for those datasets,” Rehm said, which can be problematic. "So that variant is present in 50 people, but are they 50 people with the disease my patient has or are they 50 healthy? And if you don't know that, it's hard to make use of that."

“I can imagine that there might have been other researchers for whom it was the right thing at the right time,” Glusman said. The system works for those who are interested in specific variants found in the genomes they have produced, but it is not possible to perform a global analysis of allele frequencies in the entire genome, for example. “The concept was great if it worked, but I couldn’t say that I really ever benefitted from it much.”

Whereas the 1000 Genomes Project, ESP, ExAC, and Kaviar make information freely available and clearly indicate how much data is incorporated, that is not the case with the AFC. “To use the Ingenuity system, one has to pay to have the samples active. It is a yearly payment, and at some point we as an institution stopped paying for access to our samples there,” he said. “Our data are in the system but they are not active.”

Functionality in the form of an “AFC Export” button has been added and made available for “inactive” genomes. The export is a ~300 MB file per genome, which does take time to download and needs to be handled genome by genome, Glusman noted, so while this is an available option it is not necessarily the most convenient method.

Dataset Overlap

Glusman and his team continue to develop the Kaviar database and methods for computing the allele frequencies. In addition to compiling publicly available datasets, datasets produced in-house at the Institute for Systems Biology are included along with a large dataset received from collaborators at the Inova Translational Medicine Institute. This was incorporated in a way that allows the benefit of seeing the allele frequencies without gaining access to the full genome at the same time, he explained, which benefits researchers without releasing all of the private dataset.

Despite her initial involvement and backing of the AFC concept, Rehm said she does not make use of the platform and is now devoting her attention to fully open resources. As the medical director of the Broad Institute Clinical Research Sequencing Platform, she works closely with Daniel MacArthur and his team at the Broad to report bugs and provide feedback on a recent iteration of ExAC called the Genome Aggregation Database (gnomAD). Released in the fall as a beta version, it provides the same utility of its predecessor while encompassing a much larger dataset–over 120,000 exomes and more than 15,000 genomes.

"What makes it such a high-quality resource was all of this data was run at the Broad on a high-quality platform and then the variant calling was processed using the same algorithms, so the genotype calls are actually pretty darn accurate," Rehm said. "If you think about how much the ExAC and gnomAD datasets are being used and the number of page hits every day, every minute, the impact has been profound. I would argue it's the single most useful resource I have ever had in the 25 years in this field."

Like their database counterparts, ExAC and gnomAD come along with their own limitations and caveats, though. While it is possible to access allele frequency data, it is not possible to gather detailed information. If a researcher interested in a particular variant suspected of causing disease finds it in three people in gnomAD, she may want to learn more about these three individuals, but consent limitations generally prevent that.

In addition, one of the challenges to contend with as more resources emerge is the potential overlap in datasets. If someone spots five people with a particular variant in ExAC and three in the ESP database, there's a bit of head scratching that follows. The ExAC dataset includes some of the data from ESP but not all of it. "I don't actually know if the total number of patients with this variant are five plus three or if the three is part of the five and it's actually only five," Rehm explained, "so that's where some of these datasets that are actually overlapping start to cause problems." 

"Let's say somebody's data was used in ExAC or gnomAD, but then they also are a client customer of QIAGEN and they load up that same data and then analyze their data in the QIAGEN tool and agree to let that data be a part of the Allele Frequency Community," she added. "Now that data [are] in three places." While researchers generally say the more data the better, this is a legitimate challenge to contend with as resources continue to emerge over time.

Lighting the Way

Rehm pointed to the Global Alliance for Genomics and Health (GA4GH), an international coalition formed to enable genomic and clinical data sharing, as one group that is working to tackle the various challenges surrounding allele frequencies in order to better harness the potential. Sharing tends to come along with technical and ethical issues, but doing so on a global scale can ramp up the complexity one more notch as many countries have laws preventing data from being moved out of the country and stored elsewhere.

A federated model is one possibility that is currently being explored. The idea is that many different databases can be electronically connected in real time through a federated platform, allowing users to go on, ask a question, and get a single answer that draws information from all the databases at the same time.

Rehm also mentioned the Beacon project, launched by the GA4GH, which is intended to test the willingness of international sites to share genetic data in the simplest of all technical contexts. If a researcher wants to find out if anyone else in the world has a certain variant, step one is to obtain a simple answer. "All it says is you ask me if my dataset has this variant and I will tell you yes or no. You're sort of putting up a flag to say, yep, I've got that variant in my dataset," Rehm said.

It is a low risk way to share data without the need to cross any regulatory barriers. If the answer is no for 100 databases and yes for one, there is an opportunity to contact that group, explain the research underway, and ask if it is possible to access any additional information. It is an opportunity to collaborate and share, and according to Rehm, "there are hundreds of beacons that have been lit, so to speak, across datasets around the world."

Paul Nicolaus is a freelance writer specializing in health and medicine. Learn more at