Data Management: The Next Generation


By Salvatore Salamone
Oct. 10, 2007 | Each new generation of DNA sequencers, mass spectrometers, microscopes, and other lab equipment produces a richer, more detailed set of data. We’re already way beyond gigabytes (GB): a single next-generation sequencing experiment can produce terabytes (TB) of data in a single run. As a result, any organization running hundreds of routine experiments a month or year, or trying to handle the output of next-generation sequence instruments, quickly finds itself with a massive data management problem.

“Experimental data is growing exponentially,” says Rico Magsipoc, chief technology officer at UCLA’s Laboratory of Neuro Imaging (LONI). “With MRIs, as resolution has increased, data sets have grown in size and can be from 78 to 500 GB.” According to Illumina, a fully tasked Genome Analyzer instrument yields more than 100 TB in image and processed data in a year, forcing some groups to discard raw image files to  conserve storage.

In addition to lab equipment producing larger output files, another problem is that more and more of today’s datasets have meta-data associated with them. “We’re already seeing some adoption of Web 2.0 type technology in microscopy and microarray imaging where users are tagging, characterizing, and indexing the data,” says James Reaney, director, research markets at BlueArc. This has profound implications for storage system capacity.

For many years, the way to handle data growth was to simply throw raw storage capacity at the problem. But that approach no longer works. Besides dealing with capacity challenges, life sciences organizations must also deal with performance, management, and energy issues when it comes to their storage systems.

Performance Matters
Lab data needs to be processed, analyzed, and visualized to be of any value. Typically, this requires the use of high performance computing (HPC) clusters whose nodes must be constantly fed data. Moving the data on and off of storage devices to the cluster nodes becomes the challenge.

This was an issue at the LONI. The lab’s researchers and their collaborators conduct research into Alzheimer’s, autism, schizophrenia, and other diseases. The lab scans humans, primates, and rodents producing 2D MRI slices of the brain. These scans are then concatenated into a 3D model.

“Four years ago, we had a SAN [storage area network] and two front-end servers that fed 20 or so machines,” said Magsipoc. “We then got a 300-plus node cluster and the CPUs would be data starved.”

This is a common problem in the life sciences. Many organizations moved from high-end mini-computer systems to clusters at the same time the volumes of data that needed to be analyzed were skyrocketing. This situation creates a bottleneck. Many storage systems simply cannot keep data flowing fast enough to keep computations rolling along. Magsipoc evaluated a number of systems, opting for one from Isilon that addresses the performance issue while offering fast provisioning of new storage capacity so that additional storage could easily be added.

Keeping Pace with New Data Trends
Life sciences organizations typically have masses of data that are never modified after they are initially created. A prime example of these data are the files generated by a lab experiment. Certainly, those data are often analyzed and visualized, but the original data are not changed.

When an experiment is run, the data need to be stored on system that has the appropriate performance capabilities to support whatever analytic or visualization workflows are used to process the information. Decisions must quickly be made as to how to cost-effectively manage such reference or archival data.

“Ideally, researchers want to keep all of the [processed] data on disk,” says Magsipoc. But this is not practical given the volumes of data many organizations are dealing with today. The issue is not unique to the life sciences, of course. Many industries such as financial services, oil and gas, and manufacturing must deal with data growth, archival storage issues, and matching storage performance to computational requirements.

In fact, most companies are finding they simply need to keep data for longer periods of time. Some industries do so to meet regulatory requirements for data retention. In the life sciences, data is often kept to support intellectual property claims or new drug application submissions. According to the consultancy The Enterprise Strategy Group, the cumulative amount of archived data in corporations worldwide will grow from about 3,000 petabytes (PB) in 2006 to 27,000 PB in 2010, a nine-fold increase.

The Impact on Storage Systems Is Significant
In the past, most archival data would be moved off of online storage systems and retired to tape and eventually deleted. But today, a large portion of data must remain available online. Additionally, many applications (particularly those that use Web 2.0 and Semantic Web approaches) are designed so the data are available all the time. That means the data are not likely to be taken off primary storage and archived to tape as other data are.

The combination of these factors means life sciences organizations must manage large volumes of data and have the ability to easily add more capacity as demand requires. One way to address the situation is to store data more efficiently. That’s an approach being espoused by Vertica, a database start-up co-founded by Andy Palmer, the former CIO and senior vice president at Infinity Pharmaceuticals (See “Pack It In.”)

For many organizations, the capacity planning aspect is a challenge. “When discovery work on a project would begin, we would get space requirements from the scientists,” said Peter Herrin, senior systems analyst at Infinity Pharmaceuticals. “They would tell us they would be generating between 20 and 200 GB per month.” That’s quite a range in capacity.

Infinity Pharmaceuticals selected a storage solution from 3PAR that features a thin provisioning capability. “This allows me to allocate space on a host without using physical storage on the backside,” says Herrin. “If [the scientists] only use 20 GB, I haven’t shorted myself on physical storage that I can use for other projects.”

High Performance with Easier Management
Thin provisioning is just one aspect of easier management. Now, interest is growing in another technology that promises to deliver higher performance storage with simpler management.

For years, storage area networks (SANs) were used only in large enterprises and only for applications that required very high performance. The reason for the exclusivity: SANs were typically much more expensive to buy and, because they used Fibre Channel technology, they were often more expensive to manage since many IT people did not have experience with the technology and thus needed special training.

So the performance SANs deliver is highly desirable. However, many organizations do not want to take on the complexity of using the technology.

The situation is changing. There is now growing interest in IP SANs, which are SANs that use SCSI (Small Computer System Interface) connection technology running over IP. The advantage of IP SANs (also called iSCSI-based SANs) is that they use standard Ethernet infrastructure and low-cost interconnection devices as opposed to traditional SANs that rely on the more complex and expensive Fibre Channel technology.

With IP SANs, the equipment, infrastructure, and management costs are not as high as a traditional SAN. What is especially appealing about the technology today is the introduction of newer products that make use of 10 Gigabit Ethernet (10 GbE) to connect the storage systems. In many cases, this connection speed overcomes bottlenecks that would have required Fibre Channel in the past.

Last spring, Douglas  Gibbs, pathology bioinformatics manager at the University of Michigan in Ann Arbor, tested one of these new products from Intransa for use with a medical imaging application. “The 10-GbE infrastructure is what allows the [system] to overcome what would otherwise be front-end ‘pipe’ barriers to achieving maximum performance,” said Gibbs.

With new products like this, the research firm IDC believes the market for iSCSI is poised for rapid growth. In fact, IDC predicts revenues for these products will have a 73 percent compound annual growth rate from 2006 to 2010.

According to the industry trade group the Storage Networking Industry Association (SNIA), “the factors driving this growth have been the continuing need for IT organizations to do more with less — less capital cost, fewer administrators per Terabyte, less complexity.”

Addressing Energy Efficiency
Costs of another type are now starting to become an issue. It takes electricity to run and cool storage systems.

Like most IT and data center equipment, storage devices continue to increase in performance while physically shrinking. While the combination of higher performance and higher densities helps meet the capacity and computational requirements for life sciences research, it also increases the demand for electricity. More power is needed to run the systems and cool the densely packed (and hotter) units.

The emphasis on storage system energy efficiency was made clear in July when EMC announced annual updates to its entire product line. Normally, these announcements focus on improved performance since all of the systems take advantage of the most recent processors and disk drives. This year, performance was certainly mentioned, but so too was energy efficiency.

EMC claimed new Symmetrix, Celerra, and CLARiiON systems reduce power consumption by 33 percent. And a new version of the EMC Centera reduces power and cooling requirements by 67 percent per terabyte.

Many other storage vendors are also targeting energy efficiency. In particular, IBM, HP, EMC, Network Appliance, SGI, Quantum, and Netezza are all members of The Green Grid, a consortium of information technology companies and professionals seeking to lower the overall consumption of power in data centers around the globe.

As data volumes grow, the increased attention to energy efficiency is going hand in hand with simplified management and high performance as key criteria for storage systems to handle life sciences data.

Subscribe to Bio-IT World  magazine.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

gq92112

This Bio•IT World Briefing On “Next-Generation Sequencing,”underwritten by GenomeQuest, Inc.,
presents a selection of feature stories, interviews,commentaries, conference reports, and editorials on the emergence, opportunities, and challenges posed by high-throughput sequencing. Covered in this collection: the launch of new
platforms from Applied Biosystems and Helicos; new applications of nextgen sequencing; the rise of personal genomics; and informatics solutions to vexing problem of managing the vast volumes of next-gen data.  Download now 



sgi_hybrid

SGI's Meeting Today’s Computational Needs for Science

The quest to better understand disease mechanisms and find new treatments is driven by new laboratory technologies and ever-more sophisticated modeling and simulation efforts. As such, life sciences R&D investigations increasingly are relying on more powerful computing resources. The challenge is how to accommodate the broad mix of applications.

Addressing this issue, this paper produced by the Bio-IT World Custom Publishing Group discusses a new SGI Hybrid Computing Environment approach. It optimally uses shared memory systems, multi-processor clusters, and FPGAs to accelerate computational workflows.



sgi_protm

SGI's Supercharging Proteomics Discovery

The deeper study of proteins and their interactions can reveal scientific information once considered nearly untouchable to scientists and researchers. Today, unprecedented advancements in computing power are enabling the creation of mounds of proteomic based data along with the accompanying bottlenecks data can create.

Rather than just “simplify the experiment” to fit the computational resources an alternative is now available with the SGI Proteomics Appliance. This complimentary white paper, produced by the Bio-IT World Custom Publishing Group, looks at ways to use the Proteomic Appliance to handle the most intensive proteomics computing tasks facing science today.



Life Science Webcasts & Podcasts

Waters

Streamlining the Chromatographic Method Validation Process

waters sm podcast button120Waters® Empower™ 2 Method Validation Manager (MVM) is a business-critical, compliant-ready software that reduces time and costs required to perform chromatographic method validation by as much as 80%. Learn in this podcast how MVM streamlines the method validation process and allows the entire process to be efficiently performed within Empower 2, so fewer software applications need be deployed, validated, and maintained. Download Now


More Podcasts

Job Openings

Lilly Singapore Center for Drug Discovery (LSCDD) - Associate Director of Informatics
Lead and mentor a strong team for the Bioinformatics group at the Integrative Computational Sciences (ICS) department at LSCDD towards the development of novel algorithms, data analysis methods and software tools for drug discovery. Work closely with the Software Engineering group at ICS, and collaborate with the Discovery IT organization in Europe and USA. For additional information, or to apply visit: LSCDD 

 Lilly Singapore Center for Drug Discovery (LSCDD) - Senior Software Engineer
Join a strong team of software engineers in our Integrative Computational Sciences (ICS) at LSCDD. Collaborate with, and help develop integrated applications to process and visualize data from cutting-edge technologies used by scientists at Lilly Research Labs (LRL) and the Drug Discovery Research (DDR) teams. The Software Engineering team provides computational tools and tailored software solutions that enable the global effort of Tailored Therapeutics; ‘The Right Drug, at The Right Dose for The Right Patient at The Right Time'. For additional information, or to apply visit: LSCDD 

For reprints and/or copyright permission, please contact RMS, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext 100 or via email to bio-itworld@theygsgroup.com.