Sequencing calls for new reference best practices.
By Martin Gollery
September 28, 2010 | Commentary | Thanks to next-generation sequencing (NGS), the cost of obtaining whole mammalian genomes continues to plummet. New 3rd-generation sequencing technologies (sometimes referred to as “next-next-generation”) are in development in many locations, promising miraculously low prices and unbelievably short run times. When discussing the $1,000 genome, the question is no longer whether it will be achievable, but when. No matter how cheap sequencing the human genome can be, there is always someone promising better results for less money in a shorter time, with delivery of these vaporware systems always starting in 12-18 months. One group (GnuBio) even promises that they will be able to deliver a human genome for $30! Suddenly, your new 5,000 core cluster does not seem big enough.
No matter whether the 3rd-generation NGS hyperbole turns out to be valid, there is no doubt that we will see an influx of tens or hundreds of thousands of human genomes over the next several years. It is imperative for those dealing with the onslaught that we implement the best practices and methods to deal with these kinds of data sooner rather than later. Experience has shown that bad practices, once established, are difficult to shed. While dozens of new algorithms are being developed to align or assemble these data, what we really need are new ways to interpret and understand it.
Treating human genomes as if they were simply big sequence files is naïve. Some researchers want to be able to treat 100,000 genome sequences as they used to treat 100,000 expressed sequence tag (EST) sequences. This is a recipe for a disastrous wasting of resources. Others point out one could simply have all future genomes refer to the classic reference genome and list variations from that reference. This is vastly superior to a simplistic listing of each genome in its entirety, but is still insufficient in the long term.
The situation with sequence data has been a mess for some time. The working bioinformatician has had to contend with a multitude of database and alignment formats, both in text and binary, for several years. No truly flexible format has been available, so nearly every new tool has also had to introduce a new data structure.
The BioHDF project, a collaborative project between the HDF group and Geospiza, will take a giant step toward solving the format compatibility problem. Currently in development, the BioHDF format and associated tools will provide a flexible solution that can accommodate data that ranges from sequence to alignments to Mass Spec, with the ability to incorporate new data types as they are developed. Compression and an efficient indexing scheme means that the data may be extracted easily from anywhere in the file without having to unzip or search through the entire database. An API and library based on C should provide functions that allow the import and export of data to and from existing bioinformatics file formats. Wrappers are being generated for those who prefer Perl or other scripting languages.
While it is a necessary and useful start to finally have a flexible and compact format, we also need a better way to capture all of the information generated by the various human genome sequencing efforts. While capturing the variants from a reference genome is great, we might want to assign probabilities to indicate how common or rare that variant is. Then when a new sample comes in with that variant, we would have some reference to assess the significance of this variant in different contexts. The data would then look less like a FASTA file and more like a Hidden Markov Model.
Collections of these probabilistic models could help to further enhance the value of large scale genome projects. After all, a particular base may be considered a ‘rare variant’ when compared to NCBI Build 36, but may be quite common among the population group of interest. The genome of a person with schizophrenia, for example, may be best compared to a database of similar genomes rather than a generic ‘one-size-fits-all’ type of reference. As more data are available, the models can be further refined to a tighter granularity. Variants may be more precisely defined so that the ‘Schizophrenic’ probabilities may be broken down into ‘Asian Female Schizophrenic’, ‘Nordic Male Schizophrenic’, and so forth.
Copy Number Variations (CNVs) will present a more difficult challenge. These genomic rearrangements play a major role in many types of cancer and diseases, and in total involve more of the genome than all SNPs in a given genome. Traditional alignment algorithms fall far short when CNVs are present, and visual methods are difficult to compare computationally. Directed Acyclic Graphs (DAGs) have the capability to handle data of this type. A DAG algorithm generates a topology of the sequence, and this topology is what will be compared rather than a simple pairwise alignment.
Ultimately, the accumulation of hundreds of thousands of human genomes will be beneficial to those who seek to understand the workings of the human organism, and therefore, to all of us. The only question is, how can we best plan for the abundance of data that will soon be available and make the most efficient use of it?
Martin Gollery is founder of Tahoe Informatics. He can be reached at Marty.firstname.lastname@example.org.
This article also appeared in the September-October 2010 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.