Clive Brown, vice president of development and informatics for Oxford Nanopore Technologies (ONT), a.k.a “the most honest guy in all of next-gen sequencing,” as dubbed by The Genome Center's David Dooling, is hoping to catch lightning in a bottle again. Five years ago, he was the head of software for Solexa, spearheading the British company’s entry into the second-generation sequencing market, which spurred a $650-million acquisition by Illumina and early domination of the next-gen sequencing market. After a spell at the Wellcome Trust Sanger Institute, Brown and his fellow Solexa alum, vice president research John Milton, have shuttled over from Cambridge to Oxford to commercialize the astonishing potential of nanopore sequencing. Oxford Nanopore has not yet revealed details of its future platform, but in early 2009, published a lovely paper in Nature Nanotechnology showing that its alpha-hemolysin nanopores can discriminate between the four bases of DNA (not to mention a fifth, methyl C). With a tidy $18-million marketing deal from Illumina, ONT is working on multiple fronts—chemistry, engineering, and of course, IT and informatics. Kevin Davies spoke to Brown about the company’s progress and prospects.
Bio-IT World: Before we get into Oxford Nanopore, what was your reaction to Stephen Quake’s single-molecule genome paper?
CLIVE BROWN: It’s a little bizarre. The main positive message is that they’ve done a single-molecule human genome. This is perfectly worthy of a good Nature Biotechnology paper. The numbers they’re citing give a respectable throughput of about 2 Gigabases/day. The error rates are higher but they’re going to get that with single molecule fluorescence.
But then there’s this Table 1 (Supplementary Information) with old cost claims. Their own cost seems to be exactly what Illumina is citing for their service sequencing now [$48,000]. And it’s bizarre. They’re using the number of names on the Solexa paper [Nature 2008] as evidence of how many people are required to run an instrument! Well, that Solexa paper was the culmination of 8 years work encompassing the entire development of the platform, so it had everybody’s name on it. The CEO’s name was on there, and he didn’t run any instruments.
They’re setting themselves up for a tagline: ‘Look, we only need three people to run a Helicos machine. And Illumina needs all these people and it’s much more expensive.’ If they’d just stuck to the high ground, i.e. they’ve got a working system that does single-molecule genomes, they’d be a lot better off in the credibility stakes. But, this apparent back-door marketing stuff is ridiculous. For example, their original paper had 20 or 30-odd authors for a 6-kb viral genome.
I think Helicos deserves some kudos. They’ve stuck with it, they’ve had a rough time as a company, and they’ve made it work about as good as it can work with single-molecule fluorescence, with the cameras they have. People have taken the technology outside and they’ve used it successfully. And that’s not trivial. If I was them, I’d have stuck to that message. They should stick to the high ground—you can quote me on that.
What progress have you made since your Nature Nanotechnology paper early this year?
It’s difficult to give you specifics without revealing too much, but things are on track—without telling you what the track is! We’ve taken care of a lot of the key risks in our technology. Not all of them, but a lot of them. We’re building things. On the informatics side, we’ve actually built the entire informatics subsystem of the instrument.
One of the things you have to accomplish at a company like ONT or Solexa is to become a small genome center at some point. Before launching a product, you have to run it in house for months, doing genome-center type things. In some ways it’s harder because the system is unpolished, it therefore requires more effort to manage and run and thus certain kinds of software infrastructure are required. Roger Pettett [ex-Wellcome Trust Sanger Institute] is building that infrastructure. He’s also working on software that goes on the instrument itself, which is revolutionary new stuff, but we’re reluctant to talk about that at the moment. It does break the conventional instrument software paradigm.
Does the real-time nature of the nanopore platform make a big difference to the informatics?
What we’re doing in terms of data processing is in some ways easier than Solexa, in some ways harder. Our data rate is much higher—it is many tens of Megabytes/second. If you look at the high-end real-time computing world, it’s in the middle of the data rate range. If you look at high end military radar applications, sonar, etc. we’re two-thirds down that scale. But we’re certainly significantly higher than traditional lab equipment. We decided to tackle this data processing problem very early.
So, we’ve designed and built the basic data processing subsystems very early. Even before we were running chemistry, we made software that simulated data streams at launch spec rates. We designed interconnects and wiring, computer boards and live software that would process that data. We did it all in parallel. So when it came to plugging a chip in, it all more or less worked.
The issue of data reduction close to the instrument is one we tackled very early. But one has to be very careful. When you make one of these products, typically the chemistry evolves all the way to launch, and even after launch. So the way you build things has to be agile and flexible enough to keep up with and optimize against those chemistry changes. So you can’t really build a specification-driven rigid system—it has to be high performance but very, very flexible to change. That’s a real challenge; building a high performance system consisting of hardware and software that can evolve rapidly against real data.
Are you trying to move the data processing close to the chip, and if so, why?
When you scale up, it gets harder to move data over a wire. People are complaining about copying data over a network with the current sequencers. Obviously, just imagine that 100X worse. It isn’t feasible. So you have to move more and more of the processing close to the point of data generation. If you look in other parts of the electronics world, they have the same issues, integrating more and more stuff closer to the point of generation. It’s the same with microprocessors, e.g. Intel CPUs. They put more and more on the silicon. So we have already put a huge effort into not outputting raw data, but outputting optimized processed data instead. That’s not to say you can’t output raw data, my philosophy has always been to have an open system, that lets people dig around in the raw data and algorithms and understand it all but you would need to flick a switch to get the raw data out.
Do you plan to use accelerators such as GPUs or FPGAs?
Two things are going on there. There’s what goes on inside the instrument, and then you have these interim experimental data sets to deal with during the product development phase. So, classically, what you do is write data to disc, and then you have an analysis pipeline that runs on your cluster to look at your data centrally. That’s where the Solexa software came from—the Solexa GA pipeline was actually that, but ended up being shipped for various reasons. We have that pipeline equivalent in house, and have been looking at implementing some of its algorithms on GPUs. For our internal computing needs, there are some attractions to this. But you can’t yet economically stick a GPU inside an instrument, because the power consumption gets quite high.
The other option is to use FPGAs. They’re good accelerators, very low power requirements, but a bugger to program and so not very agile. Once you’ve got your algorithms finished, specifiable, at that point, you can stick it on an FPGA. So FPGA is something we might use at the end of the development process, but not during it.
Bear in mind that standard CPUs are getting faster and faster—they’re not too shabby at all! So far we haven’t had any problems in terms of compute speed when dealing with our data, either at the instrument level or centralized datasets. We haven’t had any problems with that at all, but we are constantly trying to drive up the efficiency of data processing.
Have you started simulating the data processing in real time?
Yes, and it’s quite early, and we’re not scared! But I’ll tell you what’s hard—the hardest bit is the wiring! Wiring two bits of circuit board together… The processing side isn’t so bad, it’s all the interconnects, moving data from A to B. We’re building and designing most of the instrument internally. We use some off the shelf components, and have partnered for others, but we do a lot of our own PCB design.
You’ve said you’re having to “beat the software vendors off with a stick.” How do you see yourselves working with them eventually?
Where I see them coming in is in dealing with what comes out of the ports on the back of our sequencer, which hopefully should be a quality scored DNA sequence, possibly with some accessory information for QC. Obviously, we’re targeting high-end large-scale laboratories, but we’re also targeting the bench-top. In order to have a bench-top sequencer that does the kind of applications being pioneered in the genome centers, to have that accessible to any researcher, we have to provide pretty easy to use bioinformatics solutions. Otherwise, it’s just not going to happen, is it? One of the problems with all these existing sequencers is, even if you automate the sample prep and make the sequencer easy to use, you still end up with a file with a billion short reads in it. This is still beyond the capability of most non-bioinformatically trained postdocs to do anything sensible with. Our system will have even more of that, and we are looking at longer reads. Nevertheless, a massive amount of complex data.
We have to provide, next to our sequencer on the bench—I hate to use the word ‘turnkey’—but a pretty polished bioinformatics solution to deal with that. I’ve been talking to a number of interesting companies—there’s a current favorite—we’ve been doing some proof of principle work. They’ve been demonstrating that their workflow software can deal with a whole human genome-type workflow in a day or 6 hours on a typical workstation. For example, it reads in data from a massively fragmented human genome, then executes a standard QC and alignment workflow, and then produces graphical reports and SNP calls. They’ve done it in reasonable time, in my view, both for development effort and speed of execution. I can’t say who it is, yet, but that looks quite promising to me.
Are you writing your own alignment and variant-calling software?
Nava Whiteford is working on a paper at the moment on very long-read sequencing and its informatics. He’s adapting existing algorithms already out there. For contrast, at Solexa, Tony Cox had to come up with a completely new one called ELAND to deal with mountains of short reads. However, we think we can adapt methods that exist already in such a way as to work optimally with our data. We’ve also developed some interesting file formats for our data—called Fast5 for scored sequences. We’ve also co-developed some HDF-5 based raw data container formats and those are all public domain. We’re slowly putting in the groundwork for adoption.
You say you’re not providing IT solutions, but won’t some early customers want that?
A lot of early customers, because of the current ‘next-gen’ sequencers, will have made the IT investment anyway. A lot of that can be recycled for our sequencer. Our sequencer won’t have the IT overheads of the current sequencers—it’ll be greatly reduced, even though the output will be higher. And unless we get merged again, and all my plans get derailed, I’m going to partner with somebody in terms of the early adoption process. These people will liaise with IT groups and help us to smooth the early adoption of lots of our systems. It won’t be as big a barrier as it was at Solexa, but we will have to put some effort into it. I’m more worried about the bench-top side than the high-end side. Once we go to full commercial launch we are partnered with Illumina and there will have to be agreed and synergisitic approaches.
So you’re targeting two very different markets—genome centers on the one hand, the bench top on the other?
Traditionally, there are about 15,000 of the old capillary/gel-based sequencers. They enjoy about 4% growth per year, mostly in forensics and other conservative market segments. There are about 900 next-gen sequencers out there now. They’ve traditionally gone to people who do already do some sequencing, e.g. core facilities, genome centers etc. However, what will the future look like? Well, an obvious direction is the bench-top sequencer, which sits next to a postdoc or other lab researcher. We target that. We also target genome centers, in that they can run lots of our machines in a scaled up operation, rather like the [Illumina] Genome Analyzers are currently run.
The thing that isn’t really there yet, which oddly enough is being broached by Complete Genomics, is this idea of very large sequencing centers that are not traditional genome centers. They’re either commercial service operations or they’re run by health providers, and they’re doing medical sequencing or whatever. Those don’t exist yet, but I can’t imagine why they wouldn’t in the future. We go across the board from the very small to the very large. What’s new here is the very small and the very large. The middle bit is the old market, still important but a subset of our ambitions.
There are people out there who want to do these new sequencing applications perhaps at a smaller scale or more occasionally. These applications have been pioneered by people who already bought Solexa or SOLiD, but they’re not quite accessible yet. At the other end of the scale I think Complete Genomics is a perfect customer for us. In fact I think our machine’s better suited for what they want to do than theirs is!
It sounds like everything is on track. Where are the problems?
I don’t want to oversell things, and remember we are still very stealthy as a company. On the other side of the coin, any of the more honest people will tell you things just aren’t linear in a company like this. You have days when things work beautifully, and long dry periods when things aren’t working. Half of it is just keeping your nerve. You’ve just got to plough through it all.
Surely your Solexa experience must be helpful in this regard?
Yeah, I think so. [Former Solexa CEO] John West would always say, “You’ll be amazed what you can get to work if you put your mind to it”. The average age of employees here is just 26-27. So we’re trying to impart that attitude to people who haven’t been through this process before. It’s quite a leadership challenge, actually.
What are some of your group doing specifically?
Let me just say, I think they’re all fantastic. A few examples: Stuart Reid, a physicist by background, is doing a fantastic job. He’s very cross-disciplinary but he works in my group which also deals with product design, specification and systems integration. So a lot of what we’re doing is about data and data quality measures and targets. Stu is driving that, along with some of the fundamental science that feeds into the platform. Roger Pettett, who joined us from Sanger, is doing a critical job actually. He’s a very understated chap, but we have to build a ‘genome center’ here and Roger is driving the informatics infrastructure to accomplish that. Lukasz Szajkowski, who joined us from Illumina, is managing the writing of the instrument software. To my mind, one of the most risky areas, but it’s all on track thanks to him.
Nava Whiteford, who also came from Sanger, wrote the first proper publication on short-read sequencing feasibility. He’s doing a fantastic job across the board on bioinformatics and IT. I’ve got Gavin Harper, a statistician from GSK, churns through mountains of raw data, measuring things, and keeps us all honest. We have Mick Knaggs, a very experienced molecular modeler, who has provided key insights into the nanopore design and optimization. He has implemented a lot of his software on GPU-enabled systems in house. There are many others; they’re all actually very, very good. I’ve been very lucky to get them.
Hmm… The Sanger Institute must love you…
Yeah, we did have a chat about my recruitment methods.
A news story based on this interview appeared in the September-October 2009 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.