By Anthony Strattner
It's no secret that the lack of standards for data exchange and data representation is a serious bottleneck for life science researchers. Increasingly, they must deal with large volumes of data and diverse data types from different sources, all of which must be painstakingly pieced together and analyzed before they can do the science. Now, relief may be near. The Interoperable Informatics Infrastructure Consortium (I3C) plans to demonstrate its reference model of life science-specific protocols at the June BIO2002 conference in Toronto.
Don't start celebrating yet. Consortia don't have great track records in these matters — similar efforts by the Common Open Software Environment (COSE) and the High Performance Fortran consortium never fulfilled their promise. Nevertheless, I3C seems to be avoiding past mistakes and gathering momentum. Formed in January of 2001, IC3's ranks have since swelled to more than 75 participants. Also in its favor, its organizers say they will focus on realistic goals.
"We're not trying to solve world hunger," says Sia Zadeh, group manager for life sciences at Sun Microsystems Inc. and a founding I3C organizer. "We're keeping the objectives feasible and achievable." That doesn't mean they aren't ambitious. The I3C lists its key objectives on its Web site (www.i3c.org):
- To facilitate open development of standards, protocols, administration and the technical infrastructure for the life science industry.
- To establish a common communications standard protocol that is extensible and can be delivered to the community in a timely fashion.
- To provide forums for discussion of issues that affect technology evolution, development and use — technical, commercial, etc.
The quick growth in I3C ranks suggests the life science community is eager to adopt a standard set of procedures and protocols for accessing, integrating and sharing data to facilitate drug discovery and development. I3C has recruited a diverse coalition. Participants range from IT stalwarts such as Compaq Computer Corp., IBM, Oracle Corp., and Sun, to life science and informatics heavy-hitters such as LION Bioscience AG, Accelrys Inc., and Affymetrix Inc. There's a smattering of academic and government organizations including the Whitehead Institute Center for Genome Research, Los Alamos National Laboratory and the National Cancer Institute.
Conspicuously absent, though, are the big pharmaceuticals, such as GlaxoSmithKline, Johnson & Johnson and Merck & Co.
Eric Neumann, vice president of informatics at Beyond Genomics and I3C organizer, says I3C will become more attractive to drug makers when its standards work encompasses not only data-analysis technology, but also scientific discovery and clinical studies. "I want to show them that things can move faster than with other standards bodies," says Neumann. "I want to make sure the science needs are addressed effectively by I3C."
Following its almost yearlong developmental efforts, I3C formally incorporated in February and elected board officers. Tim Clark, vice president of informatics at Millennium Pharmaceuticals, was named chairman of the board. Jill Mesirov, CIO and director of bioinformatics and computational biology at Whitehead Institute Center for Genome Research was elected vice chair.
Other board members include: Morrie Ruffin, vice president of business development at Biotechnology Industry Organization (BIO); Jeff Augen, director of business strategy for life sciences at IBM; and Sun's Zadeh. The board is seeking an executive director and reviewing several proposals from regions for a site to locate IC3 headquarters.
"The work on standards development has been ongoing," says Zadeh, "and we're still planning to present our draft specs at the June BIO conference. I expect to see I3C-compliant products offered in less than a year.'' Organizations that have been working with I3C on a voluntary basis must now apply for membership.
Thus far, I3C has leaned heavily on work done by the World Wide Web Consortium (W3C) to develop common protocols for the Web. In the Internet era, proprietary protocols and low-level APIs (application programming interfaces) for network operating systems are rapidly ceding control of corporate networks to Web services. A Web service is simply an application delivered as a service that can be integrated with other Web services using Internet standards. Technically, it is a URL-addressable resource that programmatically returns information to clients who want to use it.
Speaking the Same Language
No accident, then, that I3C's first demonstration of interoperability among heterogeneous files, applications and systems at the BIO2001 conference in June relied heavily on the Web's standard meta language for describing data, XML (extensible markup language).
Using a unified set of XML protocols, the group cobbled together 10 different products that enabled the exchange and analysis of sequence data. The demonstration was limited in scope — the workflow of a typical molecular biologist who identifies a collection of sequences and performs a series of analyses to further explore the data — but it worked. That architectural model has since become a reference platform for I3C's first Working Group in Genomics, charged with finding a practical solution for integrating genomics data.
Unlike historical approaches to data integration that focused on object model-specific protocols such as
|I3C's Intellectual Property Agreement
Technical-standards consortia are generally formed, at least in part, to create or modify technology. Once this has been accomplished, the sensitive question of who owns the technology arises.
"Any time you try to define a framework for open applications and data integration, the participating companies may need or want to contribute technology that they have patented or have other rights to," says Sia Zadeh, group manager for life sciences at Sun and a founding I3C organizer. "Once a member has contributed its technology, you cannot leave the question of ownership unsettled."
Not wanting to reinvent the wheel or repeat past mistakes, Zadeh and his fellow founders looked for a precedent. They discovered the World Wide Web Consortium (W3C) settled the issue by drafting into its bylaws and membership agreement a waiver to all rights to intellectual property created or used in consortium-led solutions. I3C's founding firms agreed to draft a similar waiver.
"The consensus was for a royalty-free approach," says Zadeh, "like open source software." When a programmer, either as a private individual or as the member of a larger organization, works on a piece of open source software that's then made available to the public, that person releases all ownership claims to the software. Ultimately if Sun, IBM or another I3C member contributes technology as part of specifications subsequently ratified by the organization and adopted by other member companies, that technology has no charges or fees attached.
DCOM (distributed component object model), RMI (remote method invocation), or IIOP (Internet inter-orb protocol), the genomics working group is working at the higher, "meta data" level. XML is a logical choice, because the W3C created it specifically to be a robust language for representing commands and typed data. Coupled with SOAP (simple object-access protocol), a message-based protocol for accessing services on the Web, data files may be encoded and transported over HTTP-supported networks.
The genomics working group is also looking at protocols for accessing and querying disparate databases of biological data. Two such protocols under review are enterprise java beans (EJB) and the distributed annotation system (DAS). Because Java objects can be dynamically created from XML documents, EJB-based middleware could let any Java client access biological databases that represent their proprietary data schemas as XML schemas. With help from HTTP and XML technology, DAS is designed to provide a common protocol for gathering genome-annotation data from different sources, collating that information, and displaying it in a single view.
The genomics working group plans to have a "solutions" draft ready by June to present at the BIO2002 conference. Rather than simply a set of technical specifications, this solution will combine a scientific "use-case" representation — a task list of particular data and analysis requirements — and a fully documented and open technical implementation.
"This is a very different approach from getting a bunch of engineers together to design a new programming model," says Neumann. "It's working more the way scientists do in life sciences companies; we're getting together to show that the solution works, and to figure out ways to do it better next time."
Meanwhile, there will be working groups in proteomics, pathways, and ontologies, developing their own solutions frameworks, which are not expected to be ready before the end of 2002. The proteomics working group is expected to build on the work done by the genomics group; the working groups for pathways and ontology will produce recommendations for how biological data is referenced by translation to a common naming scheme.
The work of the genomics group will be central to realizing I3C's ultimate goal: the creation of XML forms that not only define microarray data and proteomics data, for example, but also enable the merging and analyzing of that data. This language will act as a bridge between different experiments. Such a language will be able to bundle all the information involved, from experiment design and execution, to data acquisition and analysis.
A framework for this "data fusion" is at least a year away, but many think I3C's approach of advocating higher levels of data representation rather than schemas, from which schemas can then be mapped, augurs well for success.
Anthony Strattner is a writer based in Framingham, Mass. He can be reached at firstname.lastname@example.org.