By Salvatore Salamone
May 15, 2004 | Better data integration -- not bulging IT muscle -- dominated discussions at “Informatics and IT Infrastructure” conference sessions and workshops at the 2004 Bio-IT World Conference + Expo. More computing brawn, speakers and attendees said, won’t solve the problem of dealing with the flood of data and the fact that so much data is derived from multidisciplinary research.
Until now, the preferred tactic for tackling formidable informatics tasks has been to apply more computer horsepower. Marshall Peterson, CTO of the J. Craig Venter Foundation, put the data-handling crunch into perspective: “There have been great strides in high-performance computing -- the ASCI [Los Alamos National Laboratory’s] program has done miracles in scaling HPC, but it’s not the HPC [we] need to get our jobs done.”
Dawn of the Petabyte
Researchers are grappling with a new computing model, Peterson emphasized, which he labeled as petabyte computing. “The tools needed for the next steps [such as pursuing personalized medicine] will require managing massive amounts of heterogeneous data in a global environment,” he said.
Rather than just look at sequence data, the emphasis will be on examining data from many sources. “It’s the difference between just doing a [single] SQL query versus doing an SQL query, running a BLAST command-line search [with] the results, and then using some Perl code to do something with the output,” Peterson said.
Other speakers echoed Peterson’s sentiments. “There’s been a change in view in the industry,” said Jeff Augen, president and CEO of TurboWorx. “You don’t just get a few blades, populate a cluster, and run applications. We’re dealing with computer-intensive problems that involve searching many databases.”
Today, many life science organizations apply a kind of manual brute force to the problem of examining diverse data sets, but it’s a solution that soaks up researchers’ time. This approach is not very productive, said Jill Mesirov, chief informatics officer and director of bioinformatics and computational bilogy at the Broad Institute.
“You can’t let the scientist be the infrastructure,” Mesirov said, describing how researchers often act as the conduit that integrates diverse data. This integrative process today “takes lots of Perl scripting and Web scraping,” she noted.
Mesirov described work the Broad Institute has done on its open-source GenePattern analysis software, which was announced in April. GenePattern is built upon what Mesirov said is a software infrastructure for interoperable informatics. “We [needed] an infrastructure that is grid-enabled, Web services-based, and collaborative,” she said.
Others concurred. “What’s needed is a scalable infrastructure that is open-source and allows [researchers] to pull together many different data sources,” said Andrea Califano, a professor at Columbia University. He noted that today, “bioinformaticians spend most of their time writing data parsers.”
App Development the Key
Another theme sounded throughout the conference was that software development remains a major industry effort. “Life science companies are software companies,” said Tim Mattson, senior research scientist in Intel’s Computational Software Laboratory.
He cited an IDC survey of a few years ago in which only 46 percent of the software being used in life science companies was provided by commercial software vendors. Most software being used was taken from the public domain or had been developed in-house or by contractors.
In general, this pattern has not changed. “The life science industry is still developing a lot of software,” Mattson said.
One persistent driver of internal software development is commercial software’s inability to keep pace with changing science. “Shrink-wrapped informatics is for amateurs,” said Scott Sneddon, senior fellow at Genzyme. “[We] don’t have the luxury to wait six months for a vendor to release a new product. And you don’t want to do science that is directed by software [availability].”
Another factor is, often, no commercially available option is available. Peterson noted that many tools are used in the business intelligence world, “but not many are applicable to life science.”
Added Sneddon: “Most operations require the sharing of data between scientific disciplines, and this [can only be done using] custom data analysis, query, and visualization tools.”
Yet the desire for software to encompass the latest science insight and retain flexibility is sometimes at odds with the high regulated environment in which pharma and biotech work. This fact was duly noted by Michael Rosenberg, associate director of bioinformatics at Biogen Idec.
“We operate in a [Good Laboratory Practice-] and CFR-compliant environment,” Rosenberg said. “We felt we had to move away from custom [Visual Basic] applications to industry-compliant tools.” His group still uses a mix of custom VB applications, but he said it also relies on “industry-strength applications” from vendors including Documentum, BEA, and NuGenesis.