Oct. 10, 2007
| Each new generation of DNA sequencers, mass spectrometers, microscopes, and other lab equipment produces a richer, more detailed set of data. We’re already way beyond gigabytes (GB): a single next-generation sequencing experiment can produce terabytes (TB) of data in a single run. As a result, any organization running hundreds of routine experiments a month or year, or trying to handle the output of next-generation sequence instruments, quickly finds itself with a massive data management problem.
“Experimental data is growing exponentially,” says Rico Magsipoc, chief technology officer at UCLA’s Laboratory of Neuro Imaging (LONI). “With MRIs, as resolution has increased, data sets have grown in size and can be from 78 to 500 GB.” According to Illumina, a fully tasked Genome Analyzer instrument yields more than 100 TB in image and processed data in a year, forcing some groups to discard raw image files to conserve storage.
In addition to lab equipment producing larger output files, another problem is that more and more of today’s datasets have meta-data associated with them. “We’re already seeing some adoption of Web 2.0 type technology in microscopy and microarray imaging where users are tagging, characterizing, and indexing the data,” says James Reaney, director, research markets at BlueArc. This has profound implications for storage system capacity.
For many years, the way to handle data growth was to simply throw raw storage capacity at the problem. But that approach no longer works. Besides dealing with capacity challenges, life sciences organizations must also deal with performance, management, and energy issues when it comes to their storage systems.
Lab data needs to be processed, analyzed, and visualized to be of any value. Typically, this requires the use of high performance computing (HPC) clusters whose nodes must be constantly fed data. Moving the data on and off of storage devices to the cluster nodes becomes the challenge.
This was an issue at the LONI. The lab’s researchers and their collaborators conduct research into Alzheimer’s, autism, schizophrenia, and other diseases. The lab scans humans, primates, and rodents producing 2D MRI slices of the brain. These scans are then concatenated into a 3D model.
“Four years ago, we had a SAN [storage area network] and two front-end servers that fed 20 or so machines,” said Magsipoc. “We then got a 300-plus node cluster and the CPUs would be data starved.”
This is a common problem in the life sciences. Many organizations moved from high-end mini-computer systems to clusters at the same time the volumes of data that needed to be analyzed were skyrocketing. This situation creates a bottleneck. Many storage systems simply cannot keep data flowing fast enough to keep computations rolling along. Magsipoc evaluated a number of systems, opting for one from Isilon that addresses the performance issue while offering fast provisioning of new storage capacity so that additional storage could easily be added.
Keeping Pace with New Data Trends
Life sciences organizations typically have masses of data that are never modified after they are initially created. A prime example of these data are the files generated by a lab experiment. Certainly, those data are often analyzed and visualized, but the original data are not changed.
When an experiment is run, the data need to be stored on system that has the appropriate performance capabilities to support whatever analytic or visualization workflows are used to process the information. Decisions must quickly be made as to how to cost-effectively manage such reference or archival data.
“Ideally, researchers want to keep all of the [processed] data on disk,” says Magsipoc. But this is not practical given the volumes of data many organizations are dealing with today. The issue is not unique to the life sciences, of course. Many industries such as financial services, oil and gas, and manufacturing must deal with data growth, archival storage issues, and matching storage performance to computational requirements.
In fact, most companies are finding they simply need to keep data for longer periods of time. Some industries do so to meet regulatory requirements for data retention. In the life sciences, data is often kept to support intellectual property claims or new drug application submissions. According to the consultancy The Enterprise Strategy Group, the cumulative amount of archived data in corporations worldwide will grow from about 3,000 petabytes (PB) in 2006 to 27,000 PB in 2010, a nine-fold increase.
The Impact on Storage Systems Is Significant
In the past, most archival data would be moved off of online storage systems and retired to tape and eventually deleted. But today, a large portion of data must remain available online. Additionally, many applications (particularly those that use Web 2.0 and Semantic Web approaches) are designed so the data are available all the time. That means the data are not likely to be taken off primary storage and archived to tape as other data are.
The combination of these factors means life sciences organizations must manage large volumes of data and have the ability to easily add more capacity as demand requires. One way to address the situation is to store data more efficiently. That’s an approach being espoused by Vertica, a database start-up co-founded by Andy Palmer, the former CIO and senior vice president at Infinity Pharmaceuticals (See “Pack It In.”)
For many organizations, the capacity planning aspect is a challenge. “When discovery work on a project would begin, we would get space requirements from the scientists,” said Peter Herrin, senior systems analyst at Infinity Pharmaceuticals. “They would tell us they would be generating between 20 and 200 GB per month.” That’s quite a range in capacity.
Infinity Pharmaceuticals selected a storage solution from 3PAR that features a thin provisioning capability. “This allows me to allocate space on a host without using physical storage on the backside,” says Herrin. “If [the scientists] only use 20 GB, I haven’t shorted myself on physical storage that I can use for other projects.”
High Performance with Easier Management
Thin provisioning is just one aspect of easier management. Now, interest is growing in another technology that promises to deliver higher performance storage with simpler management.
For years, storage area networks (SANs) were used only in large enterprises and only for applications that required very high performance. The reason for the exclusivity: SANs were typically much more expensive to buy and, because they used Fibre Channel technology, they were often more expensive to manage since many IT people did not have experience with the technology and thus needed special training.
So the performance SANs deliver is highly desirable. However, many organizations do not want to take on the complexity of using the technology.
The situation is changing. There is now growing interest in IP SANs, which are SANs that use SCSI (Small Computer System Interface) connection technology running over IP. The advantage of IP SANs (also called iSCSI-based SANs) is that they use standard Ethernet infrastructure and low-cost interconnection devices as opposed to traditional SANs that rely on the more complex and expensive Fibre Channel technology.
With IP SANs, the equipment, infrastructure, and management costs are not as high as a traditional SAN. What is especially appealing about the technology today is the introduction of newer products that make use of 10 Gigabit Ethernet (10 GbE) to connect the storage systems. In many cases, this connection speed overcomes bottlenecks that would have required Fibre Channel in the past.
Last spring, Douglas Gibbs, pathology bioinformatics manager at the University of Michigan in Ann Arbor, tested one of these new products from Intransa for use with a medical imaging application. “The 10-GbE infrastructure is what allows the [system] to overcome what would otherwise be front-end ‘pipe’ barriers to achieving maximum performance,” said Gibbs.
With new products like this, the research firm IDC believes the market for iSCSI is poised for rapid growth. In fact, IDC predicts revenues for these products will have a 73 percent compound annual growth rate from 2006 to 2010.
According to the industry trade group the Storage Networking Industry Association (SNIA), “the factors driving this growth have been the continuing need for IT organizations to do more with less — less capital cost, fewer administrators per Terabyte, less complexity.”
Addressing Energy Efficiency
Costs of another type are now starting to become an issue. It takes electricity to run and cool storage systems.
Like most IT and data center equipment, storage devices continue to increase in performance while physically shrinking. While the combination of higher performance and higher densities helps meet the capacity and computational requirements for life sciences research, it also increases the demand for electricity. More power is needed to run the systems and cool the densely packed (and hotter) units.
The emphasis on storage system energy efficiency was made clear in July when EMC announced annual updates to its entire product line. Normally, these announcements focus on improved performance since all of the systems take advantage of the most recent processors and disk drives. This year, performance was certainly mentioned, but so too was energy efficiency.
EMC claimed new Symmetrix, Celerra, and CLARiiON systems reduce power consumption by 33 percent. And a new version of the EMC Centera reduces power and cooling requirements by 67 percent per terabyte.
Many other storage vendors are also targeting energy efficiency. In particular, IBM, HP, EMC, Network Appliance, SGI, Quantum, and Netezza are all members of The Green Grid, a consortium of information technology companies and professionals seeking to lower the overall consumption of power in data centers around the globe.
As data volumes grow, the increased attention to energy efficiency is going hand in hand with simplified management and high performance as key criteria for storage systems to handle life sciences data.
Subscribe to Bio-IT World magazine.