July 20, 2009 | Insights Outlook | In the upcoming report “Next-Generation Sequencing: Solving the Genome,” Insight Pharma Reports spoke with Steven Salzberg, director, Center for Bioinformatics and Computational Biology, University of Maryland about managing and analyzing next-gen data and the software that’s making it happen.
Insight Pharma Reports (IPR): Next-gen sequencers are churning out tons of data. How important and feasible is archiving all of that?
Steven Salzberg: It’s important to keep the processed data, and it’s important to distinguish the raw data from the sequencer’s images. Files of imaging data from the sequencing plates, gels, or slides are gigantic. Image-processing software figures out the nucleotides from the images and generates files that are large, but not nearly as large as the images. So it’s important to save all the sequence reads so you can recall them, but not the raw images. Those images, [can] generate on the order of a terabyte of data for one experiment. When you compress that terabyte of data down to DNA sequence, it’s going to be tens of gigabases
IPR: You noted a year ago that data analysis software designed for Sanger sequencing, may not be adequate for short-read resequencing. Have instrument manufacturers been responsible for improvements in this area, or has it been largely academicians or third-party software companies?
Salzberg: Actually all three. I know the most about the ABI, Illumina, and Roche instruments. For the task of assembly, that is reconstructing a genome from the reads, there’s an assembler the Roche people have developed called Newbler that’s pretty good at assembling sequences of the type that their machine produces.
Recently, there have been several new assemblers released for use with very short reads. Three were published in the same issue of Genome Research. We tried them all. The one we like the most is called Velvet, and that’s become pretty popular in the community. It was developed at the European Bioinformatics Institute in the UK, and it really works quite well for assembling very short reads. The only limitation, which is a major one, is that the new assemblers for very short reads don’t yet seem able to handle something as large as a mammalian-sized genome.
The problem is there’s so much data that if you’re assembling a mammalian or animal genome from short reads, you have to be very careful in the way the algorithm works to manage the memory. It’s not just CPU time, which is also an issue, rather it’s memory issues. You have to read all this data in at some point, and if you don’t have enough memory, the machine just can’t handle it and the system will crash.
IPR: How much assembly is needed for resequencing?
Salzberg: For resequencing the first issue is mapping the reads onto the genome, it’s not assembly. The mapping problem is different. You can use the software the vendors provide, but you can also use the open-source software which is being developed by a number of groups. We’ve developed a program called Bowtie which, without blowing my own horn, has quickly become one of the leading programs for mapping reads to a genome.
The Bowtie program adapted an algorithm and data structure that was previously developed in computer science, something called the Burrows-Wheeler transform, which was more or less unknown in bioinformatics. This adaptation produced a very efficient and very fast program for mapping reads to a genome. Bowtie can map reads to a human genome on a standard desktop PC at a rate of about 25 million reads per hour. To put that in perspective, a five-day run of an Illumina machine gives you 40 to 50 million reads. It will then take another two hours to do the mapping on a single desktop PC. Bowtie is multithreaded, so if you have a dual core machine, it will do it nearly twice as fast.
IPR: Is there much use in sequencing centers for the kind of software that some people are putting out that kind of wraps up things in a big package with a neat user interface?
Salzberg: There’s definitely use for that. There are companies that are producing such packages, and these are probably going to be quite valuable. I’m not an end-user, so I’m not an expert on that, but I know from a few collaborators that are looking at SNPs and they want an interface that lets them see the SNPs. They don’t want to just do the mapping and get a gigantic text file as output. They want to be able to see some sort of graphical display, to see the SNPs with all the reads lined up on top of one another, to see what gene they’re hitting, what region of the chromosome are they in, and what genes are nearby.
This article also appeared in the July-August 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.