REvolution Computing brings Parallel R to life sciences.
By Kevin Davies
July 14, 2008 | With the exponential growth in data, researchers and clinicians are finding it increasingly challenging to run analyses in real time. REvolution Computing, a New Haven-based next-generation analytics software company with a background in parallel computing, is quietly helping life scientists using open source R software to do just that.
Less than three years ago, REvolution Computing began deploying parallel solutions around R software, which was developed as an open source statistical system, “to fill the gap in the wake of proprietary systems in business intelligence such as SAS and S+,” says CEO Richard Schultz.
REvolution Computing “has a long history of being able to make software run very fast,” says Schultz, whether it be in life sciences or energy exploration. The rapid growth in R’s popularity and market share enables the community of statisticians and bioinformaticians “to contribute research back into [the] project, to see modernization happen in real time.” Indeed, where SAS dominated ten years ago, Schultz says the text books are “all R now. That speaks to the pace at which this system has been applied.”
Open Source Model
REvolution’s open source business model around R is similar to the strategies of companies such as MySQL, JBOSS, and RedHat. REvolution is providing commercial support for a user base of 1 million and growing, many of whom want to know, ‘How do I take this research tool and scale on the clinical side?” Schultz points to the volumes of clinical data (MRIs, gene expression) “that weren’t even storable ten years ago. That’s the future. Mapping that data in real time onto a set of solutions.”
“The benefits of open source have to do with modernizing a challenging and difficult platform,” says Schultz. “In statistics, this is a critical point. The legacy players, their solutions have been around for 40 years. That kind of software had difficulty scaling to modern kinds of usages. Existing software just can’t keep up.”
Without the open source approach, it would take “hundreds of researchers to keep this up,” which Schultz terms “a very difficult proposition.” This way, “we take a worldwide community, the thought leaders, often with leading academic positions, and get them to contribute to the project and give back to the community at large.” He draws parallels with the open source goals of the HGP, “furthering society and drug discovery. That’s why we got engaged in this aspect of the project.”
REvolution Computing started to work with Pfizer about two years ago, says Schultz, and now has “a huge user base of R.” But R was not designed to take advantage of more than one computer core. As it turned out, REvolution had a scalability solution ready. “Our team had been doing research on multi-core computing,” he says. “Can you marry high performance technology with the community’s work on R? We set out to develop Parallel R, a high-performance version of R.”
Schultz claims that Pfizer got a “150-fold speed up on their cluster in one week” (see, “Pfizer Partnership”). Moreover, “work that would take six months can be done in a day.” REvolution has quietly been expanding its list of life science customers. The company has also worked with Novartis, Merck, Bristol-Myers Squibb, and other top 15 pharmas. “Beyond that, our software is widely used, sometimes in ways we don’t even know.”
Schultz sounds a little put out when I ask how pharma companies found his company. “For folks looking for high-performance computing, we have a reputation as being leaders in the ability to address very difficult problems.” In fact, REvolution worked on the Star Wars project, and has worked on financial applications and with other major IT companies such as Cray. “I hesitate to say, we’re a very big fish in a small pond!”
Once REvolution began deploying high-performance R solutions, it was soon faced with much broader R questions, such as: How do I support it on my cluster? What do I do with huge gene datasets, or clinical trial data? New challenges in cancer and diabetes are emerging, he says.
The only external investor so far is Intel. The chipmaker’s interest comes from two perspectives. On the one hand, Intel is a leading provider of multi-core hardware, but there’s a dearth of suitable software to take advantage of the increased performance. “They saw the synergy,” says Schultz. Moreover, Intel has “a very active open source software group. The R Project reached their radar. Our combination of multi-core and open source R was right in the sweet spot of what they found interesting.”
Intel’s investment as a strategic partner provides access to scale. “As our user base grows, our ability to support that is extremely important. Making sure we have the infrastructure to solve today’s problems, rather than yesterday’s, means we’ll have continued investment in R&D. The Intel backing provides further assurance we’ll be the leader,” says Schultz.
“We hope to get ahead of our customers, in the same way the hardware has gotten in front of the software,” Schultz continues. “Everybody’s got a multi-core machine but most software only runs on one of those cores. I’ve got a dual-core Apple laptop in front of me, but I run Word on just one core. Our job is to flip that equation back to the other side, and make the tools the statisticians and clinical researchers utilize that much more powerful.”
REvolution is gaining traction in other verticals, especially financial services,—anywhere R is used where there are lots of data or simulations. But, says Schultz, “life sciences will always be near and dear to us.”
At the Bio-IT World Conference and Expo last April, REvolution Computing announced results of a benchmark study, conducted with Pfizer, on chemical classification data on quad-core AMD Systems. The groups looked at caretNWS , a parallel version of caret implemented using Parallel R, for drug safety studies. (The caretNWS package is available publicly at: www.cran.r-project.org.)
Working with Pfizer, REvolution Computing parallelized its caret software. caretNWS provides parallel processing functionality that reduces the computational time to build models without sacrificing model quality. The research showed that caretNWS software accelerated the analysis of large data sets, reducing the number of potential candidate molecules for new drugs and increasing the efficiency of drug development.
“We were able to improve our ability to bring new medicines to the market quickly,” said Max Kuhn, Pfizer’s associate director of non-clinical statistics. “CaretNWS is an asset in the battle against the rising costs associated with new drug development, which is why this is available on a broad, public basis. The ability to conduct large data analysis across multi-core processors represents a significant benefit for drug discovery and development.”
CaretNWS was used to predict the safety component of compounds, specifically carcinogenic side effects in potential drugs. These models can also eliminate the expensive and time-consuming process of studying a large number of potential compounds in the physical laboratory. --K.D.
This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply Today.