Beacon Project Cracks the Door for Genomic Data Sharing
By Aaron Krol
August 14, 2015 | The Global Alliance for Genomics and Health, or GA4GH, has attracted over 350 member organizations since it was formed just over two years ago. Members range from small hospitals and laboratories to major genomic centers like the Broad Institute in the U.S., BGI in China, and the Wellcome Trust Sanger Institute in the U.K. ― and even a few household names like Google and Amazon. All these organizations have an interest in making DNA data easier to access and understand, whether because, like Google, they plan to house and analyze that data on a massive scale, or because they want to use it to treat patients one on one.
These members have signed on to GA4GH out of recognition that, twelve years after the completion of the Human Genome Project, the growth of genomic data has outstripped the containers built to hold it. Most of the computational structures that scientists use to deal with DNA data were invented ad hoc, at a time when sequencing even the genome of a single bacterium was a huge accomplishment. Through its Data Working Group, GA4GH wants to replace many existing standards, conventions, and file formats with new ones that will scale to searching through genomes at the level of whole populations ― and, crucially, make it easier for separate organizations to share data.
Some members of GA4GH are working to rewrite the fundamental way we map the human genome, replacing a widespread model that pegs DNA sequences to a “reference map” with a more flexible way of traversing the genome based on graph theory. Others want to attach standardized metadata to genomic databases, telling researchers at a glance where a specific piece of DNA data comes from, how it was collected, and what’s interesting about it. These are ambitious projects that will require scientists the world over to reexamine the vast troves of data they’ve already captured and characterized.
By contrast, the Beacon Project seems like a modest idea. All it asks is that institutions add a tiny search function on top of the databases they have, which project leader Marc Fiume boasts takes just ten minutes to implement. Yet the “beacons” that result from this minor tweak have become a flagship effort for GA4GH and, two years into the organization’s work, one of its most visible successes.
“We get requests for new beacons to join the Beacon Network almost on a weekly basis now,” Fiume tells Bio-IT World. “It’s been a good opportunity to engage really diverse people on changing the way we do genomics. It’s really bearing out the mission and the collaborative nature of GA4GH.”
Institutions that collect genomic data tend to have an instinctive urge to hoard it. There are real risks to making this kind of data public, or even to sharing it with outside researchers and clinicians. It’s possible to identify a person once you have his or her genome, whatever steps are taken to anonymize the information ― and that means you’ll also have potentially sensitive knowledge about that person’s health or family at your fingertips. There are also some perverse commercial incentives at work, like in the case of commercial laboratories that don’t want to share new disease-causing genetic variants they’ve found, for fear of losing an edge in the accuracy of their tests.
Within this climate of data protectionism, the Beacon Project is a clever way to ask organizations to share even a little bit of information, as a first step toward greater openness and collaboration in genomic discovery. In setting up a beacon, institutions are creating online search functions that let anyone in the world take a peek at their databases ― but only to find a particular kind of information that was carefully chosen not to overly expose privacy or security risks.
“A beacon is a web server that answers the question, ‘Have you observed this allele or mutation?’” Fiume explains. Importantly, a beacon query doesn’t ask anything about where or how an institution has seen a particular mutation. To use a beacon, you visit the relevant web server and enter the genetic variant you’re looking for, identifying it by a specific DNA letter at a specific position in the genome. The response is a simple yes or no, telling you whether that variant exists in the database you’re searching.
| • Beacon, to Locate Genetic Variants of Potential
This may sound like a pretty abstract exercise: without any information on the individuals who carry a genetic variant, you can’t even begin to answer questions like whether it has an effect on health or biological traits, or its prevalence at a population level. But, Fiume says, the simple existence of a variant in an outside database can be a vital starting point for a deeper inquiry, especially for clinicians trying to diagnose rare genetic diseases.
“This is quite often what clinicians will do ― reach out to collaborators and ask if they’ve seen this mutation,” he says. “The Beacon Network allows someone to identify where in the world this mutation has been observed before. It’s kind of a data discovery engine, which one can follow up.” Later exchanges can help pin down the real-world significance of a variant, once two or more organizations working on the same problem have been connected.
As more institutions light beacons to signal their willingness to share data, the team behind the Beacon Project has worked to make that kind of follow-up easier. Fiume and his colleagues have devised a three-tiered system for data sharing that goes beyond the basic beacon query. While the first, public tier would not divulge any information beyond whether a variant exists, later “registered” and “controlled” tiers would provide new kinds of data to users who create accounts and receive authorization from the organizations they’re querying. What kind of data is allowed to flow is up to the various beacon administrators: while a clinical lab might want to share information on symptoms and family histories, a large research institution might be more interested in the populations where a variant appears.
As these new functions develop, basic computational frameworks set up by the Beacon Project will help prevent bottlenecks in getting data across institutional borders. Already, the GA4GH team has created a single application programming interface (API) for asking questions through beacons, which ensures that everyone can pose questions the same way and get answers in the same format. This ties in with a larger goal of GA4GH, to design a series of genomic APIs for the multitude of tasks bioinformaticians have to perform with genomic data. (DNAstack, a Toronto-based company of which Fiume is founder and CEO, is already trying to make a business out of data storage and management using these new APIs.)
“In the same way there are standard protocols for sending email, we are trying to establish shared formats for sharing genomic information,” Fiume says. “The hope is that you’ll be able to leverage a very strong query language for querying variants.”
The Network Effect
For now, the top priority of the Beacon Project is offering central access to the beacons that have been lit around the world.
At the Beacon Network, GA4GH has created a single space where queries can be run across beacons set up by 20 different organizations, covering over 250 genomic datasets. While these individual beacons are searching through many different types of data stores to find variants, the shared API makes it possible to query all of them at once, and ensures that the GA4GH team can quickly add new beacons to the network.
“One of the powers of a framework like this is that we’re federating the sharing of genomic information,” says Fiume, helping scientists and clinicians take advantage of data from many different sources without having to form one-off collaborations. He likens the Beacon Network to Kayak, which distributes queries about flights or hotel rooms across multiple sites that all maintain their own databases and search processes.
Having this central network creates some housekeeping responsibilities for the Beacon Project; Fiume is now considering new requirements for what kinds of datasets can be added to the network, both as a quality control measure for query results, and to prevent redundancies. Several different organizations, for instance, have lit beacons attached to the publicly available 1000 Genomes Project, which makes answers that emerge from that dataset much less informative.
Still, the very fact that redundancy has become an issue in the Beacon Project is a testament to its success at getting beacons lit around the world. Fiume attributes this partly to a network effect, in which the participation of big players like the Wellcome Trust Sanger Institute, UC Santa Cruz, and the Personal Genome Project encourages smaller organizations to light beacons of their own. The GA4GH team can now start to prioritize, reaching out, for instance, to more institutions outside the U.S. and Europe to get a greater variety of populations represented in beacon queries.
Beacons will probably never be the basis for truly sophisticated collaborations, which will often have to share sensitive information about specific people, opening up extra privacy and security concerns. For that kind of data sharing, GA4GH is sponsoring separate projects like the Matchmaker Exchange, a network dedicated to clinical labs that specialize in rare disease cases. In that project, participants are asked to put forward whole patient profiles and go through multiple layers of consent before they can make data transfers, involving patients themselves in the decision-making process.
But there will be niche cases where a beacon can be highly informative on its own. Fiume remembers fielding one request to light a beacon in a database made entirely of misleading genetic variants ― alleles that seem like they may be disease-causing, but which follow-up research has flagged as false positives. Queries to that beacon could be used as a quick way to rule out suspected variants while working on a genetic diagnosis.
That project has not panned out, because of concerns about how the database would be curated and governed. But it does hint at the kinds of creative, open projects that could be enabled by a shared language for genomic data. And as GA4GH keeps working to tackle the knottiest problems of creating a sustainable infrastructure for genomics, the Beacon Project will continue as the leading end of the wedge, opening up the social structures that will allow this crucial data to be freely and responsibly shared.