How to Accelerate Rare Disease Research? Write a Dictionary

By Aaron Krol

October 30, 2013 | A patient who enters a genetic clinic today on suspicion of having an unknown genetic disorder only has about a fifty percent chance of ever receiving a diagnosis. The work of connecting these vanishingly rare cases with each other, with causal mutations, and with the relevant molecular pathways is laborious and constrained by the small, disconnected data sets involved. Researchers in the field are waiting on a number of major advances to make their work easier – some of which, like fast and reliable sequencing for copy number variations or robust international medical records, will require scientific and political breakthroughs that may be years away.

Others just require a great deal of determination.

At the American Society of Human Genetics (ASHG) annual meeting last week in Boston, Dr. Ada Hamosh, Clinical Director of the Institute of Genetic Medicine at Johns Hopkins Medical School, announced the introduction of a powerful new analytical tool for the study of rare genetic diseases. The project was a year in the making, required the cooperation of numerous international organizations and a panel of expert clinical geneticists, and might colloquially be called a dictionary.

The Problem Is Communication
Over the past several years, it has been widely acknowledged that researchers of rare genetic diseases need comprehensive descriptions of patients from around the world to understand the conditions they study, as the work of finding just two related cases may require casting a net that spans continents. A number of practical tools have arisen to fill this need, including the Deciphering Developmental Disorders Project, PhenoDB, and the PhenoTips software tool, to name three that were launched in 2012 alone. In addition to these tools for building databases of individual patients and families, there are also valuable resources for researching the phenotypic descriptions and genetic associations of known rare diseases. The London Dysmorphology Database, the Human Phenotype Ontology (HPO), and the Orphanet portal all offer searchable, nesting lists of human phenotypes broken down by affected systems, associated disorders, and related genes and variants. With such an abundance of freely available information, it might seem like a simple chore for researchers to track down any and all relevant cases in the literature and draw the connections that could lead to diagnoses and better understanding of a rare genetic disease’s pathology.

The trouble is, these different systems have no way of easily communicating with one another. Each has its own unique list of phenotypic terms and order of subdividing those phenotypes, and it takes manual, intensive human curation to resolve those differences. If an old case study says a patient has “mental retardation” and a reference database says a disease presents with “intellectual disability,” only a human researcher would spot that those symptoms might be one and the same. “If you’re comparing unsolved exomes,” Dr. Hamosh told Bio-IT World, “and you’ve got 20,000 variants and all kinds of different phenotypes, you must have some high-level standardization… Computers have a really hard time unless they’re speaking exactly the same language.”

The effort to resolve these differences began in 2012, when Dr. Ségolène Ayme, the founder of Orphanet and chair of the topic advisory group on rare diseases at WHO, convened a panel of clinical geneticists and representatives from several of the major databases to agree on a common mission. The group was named the International Consortium for Human Phenotype Terminologies (ICHPT), and their responsibility was to sort the thousands of phenotypic terms appearing in the literature into a common vocabulary. “The goal is that all data submission across portals will use this terminology,” said Dr. Hamosh. With that accomplished, data from disparate sources could be easily matched by computer, and algorithms could even be written to quantify degrees of relatedness between variable presentations.

After a year of outreach and advocacy, the ICHPT was able to bring on board every major organization doing research on rare genetic diseases, including the Human Variome Project, the International Rare Disease Research Consortium, and the ClinGen Project, all of which will be using the ICHPT terminology to report on cases. The grunt work, however, took place in a marathon seventeen-hour session on October 21 and 22 of this year, when Dr. Hamosh and her colleagues met to select their preferred terms, working through a list of over 2,700 phenotypes that appear at least twice in six databases examined.

The ICHPT format is designed to be simple and relevant to all research purposes. For this reason, terms are written in plain English, in the adjective-noun form. “Agenesis of the lower limb digit,” for instance, would be written as “absent toe,” with synonyms listed for reference. Phenotypes also receive a standard definition, drawn from the Elements of Morphology where available, with the HPO and Stedman’s Medical Dictionary used as backups. Working at an average rate of just over thirty seconds per term (most were uncontroversial), the ICHPT group was able to produce a polished list of 2,302 accepted phenotypes in time for the announcement at the ASHG meeting.

There is still some work to be done. Neurological, ophthalmological and dermatological terms will be reviewed be relevant experts, as these fields were not represented at the October meeting. An agreed nesting order for phenotypes also lies in the future, so that, for instance, organizations can all agree which phenotypes belong under “throat” and which under “voice.” Even when this task is complete, the work of maintaining the ICHPT terms will be continuous. “The preferred term will evolve, because language evolves,” says Dr. Hamosh, noting that previously neutral terms for human phenotypes have often become slurs over time. To keep their terms up to date, the ICHPT will meet every year or two years and update their recommendations. However, the biggest hurdle to alignment between databases has been cleared. The list of terms will be released publicly in January 2014 at the site ichpt.org (not yet active), by which point Dr. Hamosh is optimistic that all the large databases will have updated their terminology to share a common language.

Efforts like the ICHPT’s may not strike everyone as being cutting-edge bioinformatics, but interoperability of databases could be more valuable than any genetic breakthrough or well-written algorithm. By harmonizing data between far-flung studies, previously impossible comparative research becomes trivial, and researchers’ time is freed for more creative analysis. In the era of big data, sometimes the largest advances are just waiting for someone to put in the hours.