Software engineer Bruce Martin’s chief claim to fame came early in his career. Hired straight out of school in the late ‘80s, he was a member of James Gosling’s team at Sun that developed Java. He moved onto other things as Java scaled up—mobile and email communications, banking and compliance—but nothing remotely close to life sciences.
Martin’s latest challenge—building the IT and informatics infrastructure at Complete Genomics (see, “Will the Gene Microscope Change the World,” Bio-IT World, May 2009) to build a human genome sequencing service capable of delivering 1 million genomes in five years—is a doozy! If he’s having sleepless nights, he doesn’t show it. “It’s about crafting the right team with the right mix of skills and knowledge, and trusting them.” Recruited by CEO Cliff Reid, Martin didn’t hesitate. “For about a decade, I’d wanted to find something with a stronger footprint in the sciences, but also where I could contribute. So it seemed like a perfect marriage.” Martin brought the expertise in software development and high-scale computing, and built a team of bioinformatics experts, experts in genomics, assembly, and large-scale scientific computing.
Bio-IT World Editor-in-Chief Kevin Davies visited Martin at Complete Genomics (CGI) headquarters in California.
Bio-IT WORLD: Bruce, where is your data center?
MARTIN: For a variety of strategic reasons, we don’t want to build out a data center here. A data center is capitalized over 25-50 years. Some pieces of it, like generators, have a 40-50 year depreciation path. That’s just not typically something a small company wants to finance. The alternative is an outsourced data center. At this point, we have over 1000 compute cores and 1 petabyte (PB) data storage for R&D and pilot projects, with plans to scale up significantly for production. It is all located offsite at a co-location facility we lease in Santa Clara.
Think of the data center as a large warehouse with security, disaster recovery, back-up power, multiple power grids, etc. You go through these airlocks and security stations, biometrics, the whole nine yards. It’s the size of a football field lined with chain-linked cages, trays of fiber optics and lots of computers. We have a chunk of the room, with our own security process to get into the CGI cage…
Once you use a megawatt of power (enough to supply a small town of 1000 people), you have access to a market where you get your own room! You get to decide a lot more about how you want it wired and organized. This is typically called the wholesale market. These facilities are used by large-scale compute consumers like a Yahoo or eBay, or companies who divide up the space and resell it on a retail market… We have a 10-Gigabit WAN connection to our data center, which is relatively fast (1000 times a cable connection). We push all of our data offsite to this compute environment.
What happens as you move from R&D into commercial sequence production?
For productization, our focus is scale-out and that ends up being a fairly complicated transition. As our sequencers get faster, we deploy more of them, our efficiencies get better, and we need more and more compute systems. We have preferred technologies and a software architecture than can distribute the workload into a large cluster of compute and storage. This enables us to scale as our business scales—we rack it out as we need it. From a facility standpoint, we’re going to deploy into a wholesale space, where we have more flexibility and lower costs. We can pre-allocate a lot of space, build out the infrastructure as we need it, and have a much, much, much faster network connection to the data center.
One of the nice things in an area like Silicon Valley is that there’s a lot of dark fiber in the ground—bundles of fiber optics that are not in use. If you’re willing to bear the expertise and cost of running the network gear, you can cut your cost dramatically by using this infrastructure. This is what people do when they need to move a lot of data. We’ll have very reliable, redundantly connected connectivity at hundreds of gigabits at a fraction of the equivalent Internet connection. As we grow, we will be able to scale into hundreds, or multiple hundreds of gigabits per second [gbps] in 2010. That’s a function of how much data the instruments generate. Our cost, scale and reliability analysis indicate that we are better off putting most of our compute offsite. These instruments throw off a lot of data!
Have you already selected specific storage and network partners?
In some cases yes, in others we’re still in the evaluation phase. One of the things you try to do when you design an analysis pipeline is maintain the ability to switch vendors and evolve the technology platform to take advantage of new offerings. There will be technology evolution and multiple generations of hardware. The goal is to optimize around cost, efficiency and the operational aspects such as quality control and product features rather than being tied to a single platform or vendor. You’d like to roll out an RFP every couple of quarters for any major technology purchase, and have the ability to pick the current best of breed.
Most our effort has been around determining the features, functions and architecture of our ideal software platform, rather than the final technology vendors. In some cases we have our initial preferred vendors, and we know from whom we want to buy our first few tranches of equipment. We’re getting close on storage, and close on networking with computing and other elements to follow.
With modern-day data centers there are interesting cost trade-offs. Once the functional requirements are in place, for example the right instruction set or CPU architecture, there are trade-offs between manageability, because personnel and operations cost is a big cost contributor; power consumption, it could be a third of your cost; and capital. You put those three together, along with your feature/function trade-off, and that’s how you choose your vendors.
What particular vendors have you selected?
We selected Isilon for our R&D storage platform. We currently have close to 1 PB of Isilon and we run all of our sequencing operations on the platform. They’re under consideration for our production platform as well, but we haven’t made our final selection. We’re still having the bake off.
In our selection process there are multiple steps. It starts with an initial phase where your IT architecture team runs around with a butterfly net trying to capture information about technology and vendors with a very open mind… what could solve our problem? Rapidly you reduce possible solutions to a handful of vendors where you do a deep evaluation and get into the specifics of the platform. What particular architecture would work? What are the power and space and other operational characteristics? How well does it fit into your reliability model and your scalability model? We’re pretty far down the road with three vendors on storage, and it’s unlikely it wouldn’t be one of those three.
How much do you rely on past experiences and relationships with preferred vendors, or do you try to keep an open mind?
There are experiences and relationships, and you want to leverage those, but I am cognizant of three things. First, I am solving a problem new to me, so I have to be very open-minded about the design and requirements of the problem. Second, frankly, the world has changed since the last time I did this, and it will change before I do it again. You must listen to the vendors, what are their communities are saying, and find the best-of-breed solution for the problem. We don’t want to be right on the bleeding edge, but the problem we are solving is very high throughput and very high scale. It is an HPC (high performance computing) problem that is very different from what you see in the Web 2.0 or enterprise shops. The good news is that there are HPC communities in the National Labs and Supercomputing Centers leading the way, so we can draft those organizations in terms of their experience, architecture and technology choices. Third, it’s not like I do this on my own! It’s all about finding people with relevant expertise, ensuring they understand the business goals, helping them work together and then getting out of their way. I have team members with life sciences experience to help define the application, and HPC and computer science experts for large-scale storage and compute experience. I have recruited from some of the best Web 2.0 and enterprise companies in the Valley, including experts on indexing and searching. I was able to build an aggregate of the skill sets I needed. It’s the only way… hire very smart people, put them together and let the problem cook for a bit of time.
Do you have green considerations if your data center is stuck in a warehouse somewhere?
Absolutely. In our building or in an outside data center, efficiency is important because the cost of power is a large component in the cost of computing. In a data center environment, you pay for what’s known as the burden or overhead, which is the power wasted to support power distribution, cooling and anything other than the actual computation. So if I go to a data center that is more efficient, I will have some benefit from that increased efficiency. The power bill is a combination of the power to run and cool your computers, get power to the computers and all other power overhead. There is a natural drive to become more efficient, and therefore greener.
How do you ship the data off the machines to do the assembly?
Our pipeline is a subject of active R&D. A picture of what we’re doing now: Life starts as an image—we go through a few steps of image processing, where we pull out intensity data from the images. We extract all of the different channels, light frequencies, and then we go through a base-calling process.
At this point, we make a probabilistic evaluation of what the most likely base is. We score it with a vector of information that will also tell us other less likely but possible calls. (Those initial steps are all done in real time—the scale and speed are so high we don’t write that to disc. We stream it through a cluster of computers in a redundant and reliable way). That information gets poured into a pipeline that does things with a more traditional HPC cluster model. We save those calls, with the score information and the probability vectors to a large disk farm.
Then we go through a set of fairly traditional steps to filter and map the data, though we have some unique algorithms due to our read structure, and novel software enhancements. Our filtering process is relatively standard—we’re looking at things like reads with insufficient quality to be useful downstream, systemic error modes, things we know about our production processes, etc. This filtering helps downstream analysis.
We then enter into the assembly pipeline, which is optimized for human resequencing. Because our business model is focused on human resequencing, we take advantage of that knowledge to optimize and reduce computing costs. We align the reads to a reference genome– similar to other high-speed short-read realigners (mappers). Mapping is used solely to pool and organize the reads by likely positions on the genome where they’re probably going to be contributing information. We then assemble these partitioned reads using a combination of a local de novo process to generate sequence hypotheses, a Bayesian model to evaluate the quality of the hypotheses, and an optimization loop to find improvements in the calls.
We’re trying to find the best-fit hypothesis for a given set of reads. Once we get a set of very fit hypotheses in every region of the genome, we then proceed to variation calling. We’ll align the potential variations to the reference and make a call on the variation. The output of that is a genotype file.
What’s interesting about the assembly approach is that we’re computationally very efficient. We take advantage of the reference genome to organize reads, but we’re not bound to it for the variations we call. We’re able to detect fairly large variations, much larger than the 1-2-base indels typically found by current mapping algorithms. Increasing our ability to detect larger and rare variants is a major R&D project for Complete Genomics. Ultimately we expect that our sequencing and software technologies will be able to find very long and complex variations.
Why are you better at detecting short rather than large variations?
There are two contributors to that. One is inherent to short-read data. There are classes of variation—for example, medium-sized repeats—which short read platforms have a hard time resolving due to the information content of the read. In addition, our software is a work in progress, and our historical focus was on the most frequent and most studied variations—SNPs and short indels. We are progressively focusing on larger and more complex variations and our success with these will improve over time. Some are relatively easy to detect with mate-pair analysis… some are more challenging. High quality short variation calling is our current state of the art, but we have prototype software to detect other classes of variation, though it’s not yet mature enough to put into production.
What is the minimum number of reads for you to be confident you have a sequence?
As is typical for analysis algorithms, there is a trade-off between read length, error rate, coverage depth, bias and other factors. In our case, we typically require between 15X and 20X coverage per haplotype to get a very high quality genome. Sequence characteristics also can affect required coverage, and in many cases we do well with far less coverage.
What happens after assembly?
Post-assembly, we have a genotype file. We do a few additional steps to make it easier to use. We’re working on a variety of annotation steps, e.g. annotate variations by the dbSNP accession ID, functional annotation, and the like. Reports and annotation of the calls will make the data easier to use.
After annotation, there is a validation stage. Like most sequencing systems, we do QC at every step. We want to catch failures early. But there’s a certain class of analysis you really can’t do until you get the complete data set. Until you have all variations and aggregated the metrics from every stage, there’s a certain class of issues you’re really not going to detect. We do a computationally intensive final QC pass—the validation stage of the pipeline results. The goal is to look at all the metrics captured, the final result, anything else we know. The automated validation will allow us to increase our sequencing capacity to hundreds of genomes per day and maintain high quality. It is a lot cheaper and more efficient to throw computers at QC than to have people manually perform the task, and our scale allows us to make the software investments required to carry out this task in a fully automated manner.
What’s the format on the hard drives?
Our results contain both read and called variations files. We haven’t made a final selection on file formats and we are in discussions with customers, collaborators, the NCBI, and others to try to determine the best format. What should we support as our default? We may end up supporting more than one. The current format is a CGI-designed file format, but this is likely to change as we get more feedback. The short read data is encoded in a compressed binary file format designed to be small on the disk and easy to compute on. We designed it for use in our native pipeline.
The variations file format is similarly one of our own design; and it has been used by some of our collaborators with good success. It’s fairly straightforward: at a given position, we found this variation. Both file formats are designed to be easy to compute upon, because that’s what we do every day at CGI.
There’s an open question in the community about the right formats. There are as many choices as vendors and institutes. We certainly would like to see one standard emerge; I think it would be better for everybody, but the science and technology is a moving target. We may be able to contribute to solving that problem in some way.
You’re using the NCBI reference genome?
Yes, we are using Build 36 in our resequencing pipeline. We use it primarily as a computing aide and as a reference to report variants against. We’ll switch to Build 37 as it is more commonly adopted in the community. We’re careful not to have our algorithms introduce reference bias into the process, but the use of a reference is a significant benefit to computing efficiency and performance.
We’re entering an interesting period now, with a very significant amount of genomic data being generated, and we’re clearly moving to a period where the single haploid reference genome is insufficient to capture what is known. You have to ask, what is the reference? Who’s to say that one genome versus another represents the most likely base in any given position? Is a simple linear sequence the most effective way to represent the reference genome? I think we’ll rapidly enter the realm where we need to capture population statistics and have a richer, more nuanced recognition of the complexity of the sequenced population.
What are your major challenges as you scale up to 1,000 genomes in 2009 and 20,000 in 2010?
There are many challenges. Some we can solve with money or time, as they are relatively tractable. These include operational service build-out and finalizing our product. There’s a big difference between a technology and a finished product, and we are in the last few steps of product development including final hardware designs and software production. In addition, we have an extremely aggressive and interesting technology roadmap ahead of us—we will continue to focus on the challenge of ongoing R&D and technology development, pushing our sequencing platform forward in cost, performance and quality. Overall I’m confident that this will all come together because we have such a great team.