David Dooling joined The Genome Center at Washington University at St Louis in 2001 from Exxon Mobil, where he’d been developing chemical reaction models. He started as a programmer, writing a lot of software, with no life science familiarity, and picked things up as he went along. He now oversees about half of the informatics group, including Laboratory Information Management Systems (LIMS); the Analysis Developers group, which creates an automated pipeline for the bioinformaticians; and the IT group—infrastructure, network computing, and storage. Kevin Davies spoke to Dooling the same week as his group published the second cancer genome paper, in the New England Journal of Medicine, an important study that identified recurrent mutations in genes not previously associated with acute myeloid leukemia (AML).
Bio-IT World: How has life changed at The Genome Center in the time you’ve been there?
DOOLING: Well, things were good for a while, we had 10 Terabytes of disk, everything was great! Now we have about 3 Petabytes. When I started it was [ABI] 3700s. Then we replaced that fleet with 3730s, Megabytes a day. Then 454 came along, Illumina, SOLiD. It’s been gangbusters ever since.
What’s the current platform setup at The Genome Center?
Right now, we have 454s and Illuminas. We don’t have any SOLiDs any more… We’d purchased one, and [were using] a couple of others. We carried both platforms forward, but there’s a significant expense with each of them, manual labor costs, library preparation, emulsion PCR, DNA input requirements, etc. In cancer research, you just don’t have 3-5 micrograms of DNA. The Illumina has much lower DNA requirements, which we’ve driven even lower. Carrying on the informatics, lab pipelines, analysis pipelines, we carried both platforms forward but made a decision to concentrate on Illumina… Wrestling with two at a time is troublesome.
Wouldn’t the SOLiD set up two-color space be advantageous for cancer genomics?
That’s true. With the color-space correction, the reads are more accurate. I think the accuracy is marginal. The coverage values you need to be confident you’re sampling both alleles is sufficient that the marginally higher error rate you see with Illumina is washed out in the consensus.
You just published a second cancer genome. Where does that fit in with your other projects?
We aim for about 300 genomes in the next eight months. It’s the Washington University Cancer Genome Initiative—150 tumor-normal pairs. 300 genomes, 150 patients. About 1/3 will be AML, 1/3 lung cancer, and 1/3 breast cancer, with a few others probably. That’s completely separate from the 1000 Genomes Project. We’ll also be doing some glioblastomas and ovarians as part of The Cancer Genome Atlas (TCGA). In addition to just tumor-normal pairs, we have a breast cancer quartet where we have the tumor, normal, and a biopsy from a brain metastasis to see the difference between the primary tumor and the metastasis.
What level of -fold coverage do you aim for?
We have a gross -fold targets for coverage of about 30X. But to really determine breadth of coverage we SNP genotype the samples, and we track lane by lane on the Illumina—how many heterozygous SNPs are we seeing both alleles on? Once we get above 95-99% of those SNPs, we say we have sufficient breadth of coverage. Typically, in a perfect world, that’s around 23-24X. It’s typically closer to 30X, and can be more than that. For our second AML genome, a very well-behaved genome, 23-24X did the job.
Some of the sequencer runs are not usable. How do you judge the quality of alignments?
It’s different for each platform. With Illumina, you’re randomly placing fragments on the flow cell and sometimes they get too close to each other and you can’t distinguish the signal. You might hear the term “chastity filter,” because they’re not chaste, they mix inappropriately. They’re not counted at all. Then there’s the reads that don’t align. Those we keep. When we find more complex translocations etc, we’ll try to find reads that map across that, therefore those are usable reads. For SOLiD, you have “illegal transitions”, reads that differ from the color-space reference by only one color. You need two adjacent color transitions for a true SNP, so those with only one are filtered out because either that read does not align there or something was wrong with your color detection at that position. In addition, reads that don’t align aren’t super useful—only reads that align allow the color-space correction that boosts accuracy.
How much do you collaborate with the other genome centers?
It’s fairly regular. I’ve visited Sanger, Baylor, Broad. Much of the collaboration is on a project-based, results level—let’s share our alignment files, our sequence data. It’s a healthy collaboration and competition. We all like to develop our own tools, but if someone else has a tool, we’re happy to use that.
Can you describe the new data center?
We took possession in May 2008. We’re now completing a second phase of construction. The building is over 16,000 square feet. The data center is about 3600 sq ft. About one fourth was outfitted with cooling and power… Less than a year later, we’re getting the rest of that equipment installed so we can fully utilize the data center. At full capacity, it’ll consume about 4 MegaWatts of power. It’ll have capacity for around 100-110 racks of high-density computing equipment. Average 15 kiloWatts per rack, which is high. Current fully loaded blades are around there.
Are you working with any specific vendors?
For storage, we’re using a software solution called PolyServe, developed by a company that was purchased by HP. We like it, compared to something like Isilon (see, “Isilon’s Data Storage Odyssey,” Bio-IT World, May 2009), because it’s hardware agnostic. We can buy whatever servers, SAN switches, discs we want. If we decide to go away from it, we can still use those discs. It’s a proprietary file system, so we’d have to move all the data off, but we’d have to do that anyway. It’s a parallel file system on the back end that any number of heads can address. It has fail-over capability… We’ve had pretty good success with it.
On the hardware side, we’ve been purchasing HP storage, which has been the cheapest. We’re using HP and Dell servers. Blades, pretty much all Dell. It’s not like we throw stuff away! Over time, we go with whatever works best.
Did you consider commercial LIMS?
I manage about a dozen people in the LIMS group. The LIMS has been developed over a decade... We have evaluated [commercial systems] on several occasions, but not recently. Actually, we’ve talked to the folks at WIkiLIMS and Geospiza, but they’re not really designed for our scale. We’re topping tens of millions of transactions per month. We have tables in our database with billions and billions of lines.
You’re an open source advocate. How does that relate to your role at TGC?
Why open source? It’s just better software. Our entire system runs on Linux, Perl, PHP, Apache. We use Oracle but also MySQL and PostgreSQL. We have several thousand cores in our computer cluster and 250 desktop workstations that all run GNU/Linux, maintained by 1.5 system administrators. You’re talking about thousands of systems that can be maintained by 1.5 FTEs. You can’t get that with a Windows solution or a Mac solution. Granted these guys are highly skilled, but if there’s a problem, they can dig into it. At the scale we operate, we’re always breaking things. Whatever people bring in here, it breaks. We need to have the capability to tweak and to have the source code there and the communities that develop around free software. When we have problems, Google is our friend. 99 times out of 100, you’ll find someone who had that problem. With the proprietary solutions, there’s not a lot out there. They may not care about you.
Do you work with any commercial software tools?
We’ve spoken with CLC bio, we were one of the first people to partner with the Synamatix search tool. We’ve worked with Novo Craft. There’s also Real Time Genomics, formerly SLIM Search (see, “The Quest to Make Sequence Sense,” Bio-IT World, Nov 2006). We’ve had that for a couple of years and are talking to them about their next-generation alignment and analysis tools. We look at them, but it’s a tough nut to crack for those folks given the pace at which this field is changing.
Are you seeing much progress in alignment tools?
You can easily make it less of a bottleneck now. We use more than one aligner. We’re very comfortable with MAQ and have been using it for a long time. It’s not as computationally efficient as others, and we’re currently running several others in parallel with MAQ all the way through our pipeline… We’ll take alignments from each tool and run them through the pipeline in parallel. We’re aggressively testing lots of things to see what is optimal… We’re focusing on BWA [Richard Durbin’s group]. It uses the Burrows-Wheeler transform, as does Bow-Tie and SOAP2.
Do you have any need for cloud computing?
Yes and no. We’re interested in making our tools more useful to as many people as possible, releasing them open source. Part of that is making them useful in HPC environments, whether clouds, or Open Science Grid or BOINC (the engine behind SETI@Home). The one we’re most aggressively pursuing is Open Science Grid (OSG), a federation of grids that provide end-users computing resources through a granting process. It’s not like Amazon, where they charge by the hour...
The other side is that the utilization of our infrastructure goes through ebbs and flows. It’ll be much more efficient to have a system that could overflow onto OSG in times of stress, rather than have things pile up or build a much larger infrastructure just to support the heaviest utilization periods. We’re also talking to Sanger. In March, we had a Genome Informatics Alliance meeting. Amazon was there, Google, OSG, Microsoft. One of the action items was to work with those folks. Sanger took the lead with Amazon.
How do you deposit data into NCBI’s short read archive (SRA)?
We use Aspera. To the best of my knowledge, that’s the only option. There are other options, but NCBI does not provide them.
Illumina has increased throughput over the past year. Are you experiencing that?
Sure, there are two aspects to the increased throughput: increased read lengths and higher cluster densities. Read length is largely due to improved reagents. Cluster density is due to software improvements. The software now does a better job of disambiguating overlapping reads. The read length is essentially because of better chemistry, for example better deblocking of the reversible terminator. Each component of the reaction is not 100% efficient, so you get phasing. Drive the reaction closer to 100%, and you get less phasing, higher signal to noise. Our standard operating is 2x75 bp. You can run them to 2x100 but the error rate is such that it’s not as attractive for us right now.
What are the bottlenecks you anticipate in the next 12 months?
I’d be lucky if I could pick the bottlenecks for the next 8 hours! Essentially, to get to where we are right now, we’ve created a very well balanced system. There isn’t one aspect of the pipeline I’m concerned about—I’m concerned about them all in equal measure. Initially, you’re getting the data and so you buy a lot of disc space. Then you buy more compute nodes, but you can’t get the data to the compute nodes, so you upgrade your network. Now you’re not efficiently using your CPUs, so you rewrite the algorithm in C, make the computation more efficient. Then you find the disc I/O is bad, so you need a more distributed system for higher throughput. We’re getting sustained 10-15 Gigabits per second out of our disc system now. It’s crazy! A year ago, you don’t do that. So each time you dial one up, you have to dial the others up. Now they’re all at 11. It’s just a matter of keeping that stuff in balance, enhancing your monitoring’/troubleshooting techniques.
For the current generation of sequencing technologies, we’re on a good path. Everything scales really nicely. For PacBio etc. it’s going to be 1-2 years for a production instrument to really gain a foothold. I’m very interested to work with any of these 3rd generation sequencers at very early stages and figure out what the problems are. They’re going to have to deliver data in a very different way. You’re never going to have the equivalent of images—it’s just not possible at that scale. It’s likely that’s going to be much more information than you need, but you won’t know what you need. What sort of systems will be in place?
By the end of this year, you’ll have dozens of whole-genome sequences. Where are the tools to do whole-genome vs. whole-genome comparison? Linking that up with phenotypic information? That’s the other huge challenge.
Could you ever outsource sequencing to someone like Complete Genomics?
Sure, why not? By the time they hit that $5000 mark, other vendors will be hitting that mark. SOLiD said $30,000 for their genome. We’re looking somewhere around what they’re charging now per genome in the not too distant future ($20K range). That’s a fully loaded cost—including instrument depreciation.