By Kevin Davies
September 15, 2009 | Earlier this year, Pacific Biosciences founder and CTO Stephen Turner ran an animation of a real-time single-molecule sequence trace as a crawl at the foot of his slides for the duration of his talk, demonstrating not merely the impressive length of DNA reads the company could generate, but also its slightly hypnotic quality. “I do hope that some of you will watch the rest of the talk,” Turner said.
The cute animation was devised by a member of Scott Helgesen’s software group at PacBio, which is not surprising. A decade ago, Helgesen and Brad Carvey composed the opening dragonfly CGI sequence for Men in Black, before Helgesen traded New Mexico for New England and a job with 454 Life Sciences.
Helgesen is now part of the software team, headed by VP Kevin Corcoran, that PacBio is depending on to handle the data from PacBio’s single-molecule sequencer next year. The third-generation sequencing system eavesdrops on a grid of DNA polymerase enzymes, tethered to the bottom of nanoscopic wells, as they synthesize DNA strands in real time. As each fluorescently tagged base is snared by the polymerase prior to being incorporated in the new DNA strand, its signal is detected (see “PacBio Sparks Florida Fireworks,” Bio•IT World, Mar 2008). The method is dubbed single molecule, real time (SMRT).
The vital job of capturing that information and producing the informatics pipeline that converts those signals into pure sequence falls to Corcoran, who together with Helgesen, has ties with almost every competitor in the market. Corcoran formerly ran the sequencing business for Applied Biosystems (AB), and was involved in AB’s due diligence of its own next-gen sequencing acquisition, Agencourt Personal Genomics. He was previously the CEO of Lynx, which merged with Solexa in 2004, two years before Illumina acquired the new entity. Helgesen, director of software engineering, primary analysis and simulation, spearheaded software development at 454 for several years, until leaving in 2006, just as the first Genome Sequencer was released.
Real Time Analysis
PacBio has a fairly large instrumentation software group that writes hard-core firmware and builds real-time operating systems. Once acquired, the data are passed onto Helgesen’s group, which handles the primary analysis—image processing, signal processing, base calling and quality value assignments. From there, the sequence data are subjected to secondary analysis, including consensus calls and assembly.
“The biggest challenge is real-time processing of the data,” says Helgesen. “The data rates—the amount of data that comes flooding from the sensors—are really high compared to 454, much higher because we’re looking at real-time events.” Unlike at 454, where Helgesen used a CCD camera to integrate photons over time and under his control, “Now, we’re not in charge of the events happening—the molecules just do their thing and we have to watch them.”
One of Helgesen’s passions at 454 was the use of FPGA (field programmable gate array) technology. He’s circumspect on whether he sees a niche for the hybrid computing solution, but Corcoran says: “We’ll either go down the FPGA route or some of these other alternatives. Graphics GPUs are becoming very affordable and more easily programmable.”
For the prototype research instrument, which measures 3,000 DNA polymerase enzymes running in parallel, the software team has to capture the data in real time but doesn’t need to process in real time. When the commercial instrument launches in 2010, however, “the spec for the production system for shipping is capture in real time and process in real time, to keep the throughput going,” Helgesen says.
Helgesen says the processing throughput trade-off is the length of each DNA fragment and the numbers of parallel fragments going on simultaneously. The scheme is scalable, which is neat. “The big issue is the data are coming in so fast, you don’t have time to store it to disc. You cannot capture the original raw signal data, so you have to process that—first-level data reduction in real time. Even when I interviewed here and heard the number, I was like, ‘Man…!’”
Handling that problem is a concerted IT strategy, involving computers with internal blades, data reduction strategies, algorithm optimization, and more.
Reads and Errors
Late last year, PacBio published examples of its first single-molecule sequencing results in a paper in Science. The single-read errors were on the order of 15-20%, but those data were generated almost 12 months ago. “The interesting thing about single-molecule sequencing is that errors are random verses systematic,” says Corcoran. “In Sanger reads, the errors start to get worse the farther you go out. We don’t see that phenomenon.”
Another benefit of PacBio’s approach is molecular consensus. By circularizing the DNA template into a so-called SMRTBell, the polymerase could figuratively “take a couple of laps around the circular molecule, [so] you get phenomenal consensus accuracy of one particular molecule.” Turner reported individual reads of several thousand bases earlier this year.
Corcoran says a priority is to drive up the raw accuracy rates as high as possible. The consensus sequencing mode would by definition reduce throughput, but provide additional fidelity when searching for rare mutations. “In Scott’s pipeline, he has huge amounts of raw movies,” says Corcoran. “He has to identify where all the pulses are, assign a base to that pulse in real time, and then, if it’s a molecular consensus run, assign a consensus value to that particular read.”
As for the sequence traces themselves, Corcoran says customers will have the option of saving them, “but they’ll have to be saved off onto some system they provide. We’ll stream them off the instrument in real time as we’re processing.” More likely, they will go down a level and save the base calls and associated confidence values.
I asked Helgesen how PacBio compared to the nostalgic days at 454? “It’s definitely more challenging.” At PacBio, Helgesen has a team that is “really strong on simulation, figuring out everything beforehand.” “The best thing about Scott,” adds Corcoran, “is that I was explaining what I was looking for, [and] he instantaneously knew what all my problems were!”
Kevin Corcoran was a software engineer at Applied Biosystems who became head of Genetic Analysis software group. In 1992, AB spun out Lynx (see, “Just Bead It,” Bio•IT World, Feb 2004) to develop antisense therapeutics, but redirected efforts to develop short read sequencing based on technology developed by Nobel laureate Sydney Brenner. “It was the first massively parallel sequencing in a big way—we were doing 2 million events,” says Corcoran. The MPSS (massively parallel sequencing system) produced 24-base reads, used mainly for transcriptome profiling. The technology had its challenges, but “as a service, it worked very well.”
However, the technology was way ahead of its time. “You were talking to people and trying to explain the benefits of digital expression. Today, everybody gets it! The technology was ten years too early.”
In 2003, Lynx and Solexa jointly bought the assets of a Swiss company called Manteia. With Lynx running out of money and Solexa in need of engineering expertise, they entered into a transatlantic reverse merger. “It made perfect sense,” says Corcoran. “Since we both jointly owned the Manteia technology we both had guns pointed at each others’ heads. They had cash; we were a public company.” The newly public Solexa was then swallowed up by Illumina.
Corcoran opted to return to AB and run the sequencing business for a couple of years. One of his duties, along with Andy Watson, was to identify prospects for AB’s next-gen platform. “AB had a big program, looked at a wide range of technologies. We settled on Agencourt Personal Genomics. We did due diligence on a lot of technologies.”
After that, Corcoran took ten months off and “recuperated.” But with several ex-colleagues reveling at PacBio, he inevitably got the call. Still, he admits to being “very curious about our friends at Oxford Nanopore,” having gotten to know Clive Brown and John Milton during the Solexa merger.
Helgesen’s interview at PacBio was far different than his job interview at 454, where all anybody wanted to know was how he came to create the special effects for the first two minutes of Men in Black! Joining 454, his first taste of biotechnology, Helgesen had no idea if building next-gen sequencing software was possible. “Now, after going through that experience, I’m used to that situation. I’m not as nervous about it. I’m a software engineer.”
Before joining PacBio, he did talk to 454 founder Jonathan Rothberg about his latest venture, Ion Torrent Systems, but Rothberg couldn’t seal the deal. “No way I want to move back to the East Coast,” said Helgesen honestly. Now he gets to enjoy the California climate, and more importantly, as Corcoran points out, join more than “200 people who understand where you’re going. Everybody has this idea of their responsibility.”
This article also appeared in the September-October 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.