YouTube Facebook LinkedIn Google+ Twitter Xingrss  



In three years, Celera's networked data has climbed to 100 terabytes—10 times the entire printed collection of the U.S. Library of Congress. To keepdata utilization at 80 percent or better, Celera is installing a global SAN.

By Anthony Strattner


April 7, 2002 | When John Reynders joined Celera Genomics Group as vice president of information systems in the spring of 2001, he was startled to learn that Celera was managing more than 100 terabytes of networked storage. Reynders had just spent eight years at Los Alamos Laboratories working with some of the largest supercomputers in the world, yet Celera's supercomputers were managing nearly twice as much data as those at Los Alamos.

"We'd have a time-stepping application at Los Alamos where we'd crunch along for a while and what we'd save

 Celera's overhaul of its IT infrastructure will be completed by July, says John Reynders, VP of information systems.
at the end was a snapshot of the calculations," Reynders says of the Los Alamos practice of storing only the latest results. "Here, it's quite inverted." Celera must store — and quickly obtain access to — all the trace data, annotations, literature, and databases that support the company's R&D efforts. Researchers need all of these data points to make the best sense of Celera's genomic data.

Now, as Celera expands beyond sequencing the human genome and selling data to life science researchers and moves into drug discovery, the company has had to rethink its IT infrastructure needs. Reynders says he expects Celera's data storage and computation needs to vastly increase and plans to have a wholly new IT architecture implemented by July, one he hopes will be sufficient for the next two to three years. The architecture will blend cutting-edge systems with well-established ones, reflecting Celera's shift to therapeutics and general bioinformatics solutions.

"We still need fast systems," Reynders says, "but also ones that deliver high throughput reliably." To meet these requirements, Celera chose Compaq Computer Corp.'s most advanced storage system, the Enterprise Virtual Array.

Unveiled in the fall of 2001, the Enterprise Virtual Array can be scaled up to 17 terabytes (TB) using a combination of 32GB and 76GB drives. Equally important to Celera, the system can move all that data very quickly. When tightly integrated with a high-speed storage-area network (SAN), the Enterprise Virtual Array reportedly achieves data utilization rates of 80 percent or better, compared with utilization rates of 60 percent to 70 percent for ordinary SANs, says Celera.

"Life sciences companies could capture huge amounts of data and then run modeling simulations against that data," says

Unscrambling Storage: DAS, NAS, and SAN 
A look at how these technologies differ.

Read More 
  
Roger Archibald, vice president and general manager for Compaq's Enterprise Storage Array Division. The EVA's "virtualization" capability would let a company make data copies instantly — Compaq calls them "v-snaps" — without using additional storage space. "You could run two or three sets of analysis in parallel by making point-in-time copies of the data," Archibald says.

Think of virtualization as the creation of a custom index pointing directly to a set of data's physical location; hence a "copy" of the data is actually this index — not a physical copy of the data. Put another way, virtualization separates logical data access from physical data access, allowing users to create what are called storage pools from the physical storage. Virtual disks are created from these pools and are allocated to servers on the network when needed by an application.

One of the main advantages to virtualization is improved performance. According to Compaq, virtualization delivers better performance by load-balancing through the elimination of hot spindles, or hot drives. A "hot" spindle is a drive that is being subjected to intense input-output loads because the data it contains is frequently accessed. Virtualization allows data to be spread across more spindles, or disks, which improves performance.

For example, a traditional RAID storage system designed to handle 36GB of data space requires two 36GB disks to be assigned as RAID disks. So when the data are read or written, there are only two disks involved, which limits performance. With virtualization, this same 36GB of data can be spread across eight or more disks. This improves performance because more spindles are simultaneously involved in larger transfers.

"Virtualization technology can save a lot of disk space," says Mike Gulden, a system architect at Celera. "It's also an insurance policy of sorts, because you can always retrieve older point-in-time copies of data." He says that routine tasks such as moving capacity from one application to another take half the time they did with Compaq's EMA 1600, the storage system used by Celera.

System managers can add more fiber-channel drives nondisruptively, too, theoretically letting Celera buy capacity just in time to take advantage of falling costs per megabyte. However, with the company's voracious appetite for storage and its already-high 80 percent utilization rates, Gulden is swayed more by the EVA's smaller footprint and virtualizing capability.


So Far, So Good
Gulden's team is testing the Enterprise Virtual Array, evaluating how it would handle production jobs. One computer-intensive application being put through its paces is the pattern-matching search utility called BLAST. Gulden's team also is running I/O-intensive software. So far, the EVA is up to the task, says Gulden.

Once the EVA system is judged fit for regular work, it will replace Celera's existing EMA 1600 storage systems. This storage is used by Celera's supercomputers in Rockville, Md., which deliver about 1.5 teraflops (a teraflop is a trillion floating-point operations per second).

Much of the EVA's storage will be used for Oracle database servers and a variety of Network File System

Direct-Attached Storage Still Dominates Biotech 
The general trend for managing large databases is toward NAS and SAN. Interestingly, large biotechnology and pharmaceutical companies favor SAN when adding storage capacity, according to an industry-specific study by IDC.

Read More 
  
(NFS) servers. "Rather than one or two very I/O-intensive applications that need the EVA system's bandwidth, we have multiple NFS servers that must serve a large computer farm of machines," Gulden explains. "The higher the I/O bandwidth of the fiber-channel controllers, the more NFS data we can pump out to our compute-intensive machines."

The EVA is optimized with 2GB/sec. bandwidth per path behind the controller, while network interfaces are "2GB ready," meaning they support either 1 or 2GB/sec. SANs. To take full advantage, however, the entire data path to the server, including switches and host-bus adapters, must handle 2GB/sec. throughput. Not surprisingly, Celera is building just such a storage-area network — enterprise-wide.

"Our goal is a single, global SAN where we can consolidate all our data instead of having several 'SAN islands,' or individual pools of storage," Gulden says. A fiber-channel switched fabric, using 2GB/sec. Brocade switches private-labeled by Compaq, will be the likely network infrastructure.

As for the remaining upgrades in Celera's data center — network, switches, etc. — also scheduled to be finished this summer, the Request For Proposal is Compaq's to lose. Celera's IT infrastructure is currently wall-to-wall Compaq — all the storage, the workstations and servers — and while Gulden declined to say whether it will remain so, he did say, "We've been quite happy with [Compaq] so far." * 


Anthony Strattner is a writer based in Framingham, Mass. He can be reached at strattner@rcn.com. 


PHOTO BY CHRIS HARTLOVE





For reprints and/or copyright permission, please contact  Jay Mulhern, (781) 972-1359, jmulhern@healthtech.com.