ClinGen and Lessons for the Precision Medicine Initiative

June 10, 2015

By Allison Proffitt  

June 10, 2015 | When Heidi Rehm sat down with the literature, at first glance it looked like she had data from five patients. One of those was her own patient, George*, whose genetic testing had revealed two rare variants.  Rehm was doing research on the variants to try to determine if either was pathogenic, and a literature review revealed four papers discussing the variants. Four independent cases with the same rare disease helps when a clinical geneticist is trying to determine pathogenicity. Four published reports. Four datasets to compare with her own.

Except, of course, there weren’t five patients. There was only one.

After digging into the author lists and contacting one of the research physicians, Rehm determined that each of the papers was reporting her own patient’s results. George had been involved in a research study and his sample had been exchanged among investigators. Those results were published in four papers. An additional technical paper reused some data for validation of a method. And when poor George didn’t get access to the results of his testing, he came to Rehm.

Rehm figured it out, but it was a fluke. “I can’t always prove that the patient I tested definitely wasn’t the subject of a research article also,” she says. “We do the best we can.”

Heidi RehmIt’s one of the data problems that plagues genetic research now, and will only get worse as the Precision Medicine Initiative takes shape. How will we keep the data straight? How will we keep the patients straight? How will we make progress?

Heidi Rehm wears a lot of hats. She’s the Director of the Laboratory for Molecular Medicine at Partners Healthcare Personalized Medicine and Associate Professor of Pathology at Harvard Medical School. She’s the principal investigator on one of ClinGen’s three primary grants, sits on the steering committee, and is first author on a New England Journal of Medicine Special Report on ClinGen published last month.

She was also on hand when President Obama announced his vision for the Precision Medicine Initiative in January. Rehm’s experience with ClinGen will make hers an important voice as the infrastructure needs for the initiative are better defined.

Building the Clinical Genome Resource 

The Clinical Genome Resource—ClinGen—is a consortium of three research groups, as well as staff at the National Center for Biotechnology Information, who received funding from the National Institutes of Health in September 2013 to develop authoritative information on the millions of genomic variants relevant to human disease. ClinGen’s variant repository is ClinVar, the variant database of the National Center for Biotechnology Information that seeks to centralize the assertions of pathogenicity of these variants from multiple sources, and standardize how that information is presented. 

The original funding is for four years, and less than two years into the project ClinGen has already established its place as a hub of genetic information and a proving ground for the some of the challenges to come as we charge into the era of precision medicine.

As of May 4, 2015 (according to the NEJM report), ClinVar contains 172,055 variant submissions across 22,864 genes from 314 submitters—35 of which have deposited more than 50 genetic variants with medical interpretation into ClinVar. More than 118,000 of the unique variants have clinical interpretations, though 21% of those interpretations are clinical question marks—variants of uncertain significance. Only 11% of the variants with clinical interpretations have been submitted by more than one lab, the first step in arriving at a consensus. For 17% of those, the interpretations do not agree.

For ClinVar, the more interpretations, the better, Rehm says. “We want Mayo Clinic to say what they think about [a variant], and my lab to say, and GeneDx to say and Baylor to say. And hey, if we all agree, that’s great!” But as in the case of George, more opinions in the system don’t always mean more data.

“We all base our assertion that we make about variants on actual data, and a lot of that is case-level data. Therein lies the challenge.”

The patient problem isn’t one that’s easy to solve in ClinVar, Rehm explains, because it’s a variant-centric database; it’s up to the labs that are classifying the variants to understand the source of their data.

And those sources of data are about to explode. The Precision Medicine Initiative proposes a one million-member cohort, but already hospitals, research groups, and companies all over the world are growing their sequencing capacity and projects.

ClinGen is doing its part to gather some of these data. ClinGen has set up a patient portal called GenomeConnect where those who are interested in sharing their genetic data and health information can register. Patients who have had or are considering having genetic testing can share their results and take surveys to share information about their health. De-identified information will be transferred to ClinVar and other ClinGen resources for advancing genomic knowledge, and participants will receive updates when there are opportunities to connect with other participants who share the same condition, gene or genetic variant. Patients can access GenomeConnect to join, and ClinVar to search for genetic variants. Another portal, ClinGenKB, will be set up to connect researchers and clinicians to learn about the effects of genetics on human health and disease.



ClinVar is the preferred repository of all clinically-interpreted variants to emerge out of NIH grants, Rehm says, though the database won’t be an exhaustive resource. Rehm envisions it always connected to other databases.

“For example, the PharmGKB database [Pharmacogenomics Knowledge Base] is a database of information on pharmacogenetic relationships of variants,” Rehm explains. “While we are working with PharmGKB for them to deposit all of their variants into ClinVar with some sort of single-term interpretation, that single-term interpretation does not even remotely capture the depth of knowledge of pharmacogenetic variation and all these studies that support or don’t support it and what level of evidence.”

In this case, ClinVar would serve as an index of the PharmGKB data. “The idea is to put your short, sweet, to-the-point classification to say, yeah, it’s clinically significant or no, it’s not, in ClinVar for centralized download and use in genomic pipelines, but then if you want, link out to the database that has all the extra detailed information,” Rehm says.

The same model is working for the Clinical and Functional CFTR database (CFTR2) that banks data on specific cystic fibrosis mutations. CFTR2 has submitted clinical interpretations for 133 variants to ClinVar.

These are both examples of what Rehm sees as the coming model of data management, and what will be needed as more data is gathered, moving far beyond just genomic variants to incorporate whole genomes and whole exomes, rich phenotype data, and other lifestyle and demographic data. A project of that scope will demand a federated network for the data, Rehm believes.

“The old model was, you define a single centralized place where you aggregate all the data together and… they all get analyzed in a centralized site together. What’s likely to be the prevailing model is more of a federated network.”

But such a federated model will require strong infrastructure and a consensus on standards and the types of data collected—especially, Rehm says, phenotype data.

The Phenotype Problem 

In the past, phenotype data was collected at one point in time, usually a single diagnosis: “patient has diabetes.” That is grossly inadequate, Rehm says. She advocates for data staying in its original location, like an electronic medical record, that is connected to the network with APIs so changes in the data are captured over time.

“It’s very, very labor-intensive to collect phenotypes in a structured way,” she says. Ideally physicians would gather detailed phenotype data, system by system with a standardized terminology—“or better yet, ontology!”—and all of those data would live in an EHR that can be mined.

One thing Rehm would definitely not recommend? Spending all of the Precision Medicine Initiative budget—President Obama proposed $215m to start—on sequencing.

“I would argue you need to spend more time focusing on the effective collection of phenotypes to then correlate that [genomic] data to,” she says, advocating for smaller pilot projects to successfully link genotypes and phenotypes before sequencing the whole cohort.

The best way to design infrastructure, she believes, is to define projects where you are seeking answers to specific questions, and use the questions and expected answers to drive the building of infrastructure.

The process will require new systems and new ways of thinking. “Our EHR environments today aren’t 100% effective at the collection of the right kind of data,” she says.

Rehm believes achieving that will require a shift in thinking about our whole healthcare system, moving from fee-for-service to value-based care. But beyond that, Rehm sees a need for looking at healthcare data in aggregate: changing the way physicians are incentivized to treat patients, and prompting them to record what steps they took that worked or didn’t work and why.

“Nobody has been tracking it, for goodness sake! I tried this drug and the patient had an adverse reaction or didn’t get better or whatever. Where is that being tracked?... We need to support a continuous learning system so that the system is gradually learning from the data it’s collecting and then that data is accessible to physicians.”

Those are the goals Rehm has for the whole healthcare system, not just a cohort for the Precision Medicine Initiative.

“The [questions] shouldn’t be just about how we are going to pay for this million-person cohort and all of this research data we want to collect. We should be asking, ‘How do we change healthcare and have genomic and phenomic data be part of our normal collection process?’”



* Ficticious name. Of course.