Data Management: The Next Generation



By Salvatore Salamone
Oct. 10, 2007 | Each new generation of DNA sequencers, mass spectrometers, microscopes, and other lab equipment produces a richer, more detailed set of data. We’re already way beyond gigabytes (GB): a single next-generation sequencing experiment can produce terabytes (TB) of data in a single run. As a result, any organization running hundreds of routine experiments a month or year, or trying to handle the output of next-generation sequence instruments, quickly finds itself with a massive data management problem.

“Experimental data is growing exponentially,” says Rico Magsipoc, chief technology officer at UCLA’s Laboratory of Neuro Imaging (LONI). “With MRIs, as resolution has increased, data sets have grown in size and can be from 78 to 500 GB.” According to Illumina, a fully tasked Genome Analyzer instrument yields more than 100 TB in image and processed data in a year, forcing some groups to discard raw image files to  conserve storage.

In addition to lab equipment producing larger output files, another problem is that more and more of today’s datasets have meta-data associated with them. “We’re already seeing some adoption of Web 2.0 type technology in microscopy and microarray imaging where users are tagging, characterizing, and indexing the data,” says James Reaney, director, research markets at BlueArc. This has profound implications for storage system capacity.

For many years, the way to handle data growth was to simply throw raw storage capacity at the problem. But that approach no longer works. Besides dealing with capacity challenges, life sciences organizations must also deal with performance, management, and energy issues when it comes to their storage systems.

Performance Matters
Lab data needs to be processed, analyzed, and visualized to be of any value. Typically, this requires the use of high performance computing (HPC) clusters whose nodes must be constantly fed data. Moving the data on and off of storage devices to the cluster nodes becomes the challenge.

This was an issue at the LONI. The lab’s researchers and their collaborators conduct research into Alzheimer’s, autism, schizophrenia, and other diseases. The lab scans humans, primates, and rodents producing 2D MRI slices of the brain. These scans are then concatenated into a 3D model.

“Four years ago, we had a SAN [storage area network] and two front-end servers that fed 20 or so machines,” said Magsipoc. “We then got a 300-plus node cluster and the CPUs would be data starved.”

This is a common problem in the life sciences. Many organizations moved from high-end mini-computer systems to clusters at the same time the volumes of data that needed to be analyzed were skyrocketing. This situation creates a bottleneck. Many storage systems simply cannot keep data flowing fast enough to keep computations rolling along. Magsipoc evaluated a number of systems, opting for one from Isilon that addresses the performance issue while offering fast provisioning of new storage capacity so that additional storage could easily be added.

Keeping Pace with New Data Trends
Life sciences organizations typically have masses of data that are never modified after they are initially created. A prime example of these data are the files generated by a lab experiment. Certainly, those data are often analyzed and visualized, but the original data are not changed.

When an experiment is run, the data need to be stored on system that has the appropriate performance capabilities to support whatever analytic or visualization workflows are used to process the information. Decisions must quickly be made as to how to cost-effectively manage such reference or archival data.

“Ideally, researchers want to keep all of the [processed] data on disk,” says Magsipoc. But this is not practical given the volumes of data many organizations are dealing with today. The issue is not unique to the life sciences, of course. Many industries such as financial services, oil and gas, and manufacturing must deal with data growth, archival storage issues, and matching storage performance to computational requirements.

In fact, most companies are finding they simply need to keep data for longer periods of time. Some industries do so to meet regulatory requirements for data retention. In the life sciences, data is often kept to support intellectual property claims or new drug application submissions. According to the consultancy The Enterprise Strategy Group, the cumulative amount of archived data in corporations worldwide will grow from about 3,000 petabytes (PB) in 2006 to 27,000 PB in 2010, a nine-fold increase.

The Impact on Storage Systems Is Significant
In the past, most archival data would be moved off of online storage systems and retired to tape and eventually deleted. But today, a large portion of data must remain available online. Additionally, many applications (particularly those that use Web 2.0 and Semantic Web approaches) are designed so the data are available all the time. That means the data are not likely to be taken off primary storage and archived to tape as other data are.

The combination of these factors means life sciences organizations must manage large volumes of data and have the ability to easily add more capacity as demand requires. One way to address the situation is to store data more efficiently. That’s an approach being espoused by Vertica, a database start-up co-founded by Andy Palmer, the former CIO and senior vice president at Infinity Pharmaceuticals (See “Pack It In.”)

For many organizations, the capacity planning aspect is a challenge. “When discovery work on a project would begin, we would get space requirements from the scientists,” said Peter Herrin, senior systems analyst at Infinity Pharmaceuticals. “They would tell us they would be generating between 20 and 200 GB per month.” That’s quite a range in capacity.

Infinity Pharmaceuticals selected a storage solution from 3PAR that features a thin provisioning capability. “This allows me to allocate space on a host without using physical storage on the backside,” says Herrin. “If [the scientists] only use 20 GB, I haven’t shorted myself on physical storage that I can use for other projects.”

High Performance with Easier Management
Thin provisioning is just one aspect of easier management. Now, interest is growing in another technology that promises to deliver higher performance storage with simpler management.

For years, storage area networks (SANs) were used only in large enterprises and only for applications that required very high performance. The reason for the exclusivity: SANs were typically much more expensive to buy and, because they used Fibre Channel technology, they were often more expensive to manage since many IT people did not have experience with the technology and thus needed special training.

So the performance SANs deliver is highly desirable. However, many organizations do not want to take on the complexity of using the technology.

The situation is changing. There is now growing interest in IP SANs, which are SANs that use SCSI (Small Computer System Interface) connection technology running over IP. The advantage of IP SANs (also called iSCSI-based SANs) is that they use standard Ethernet infrastructure and low-cost interconnection devices as opposed to traditional SANs that rely on the more complex and expensive Fibre Channel technology.

With IP SANs, the equipment, infrastructure, and management costs are not as high as a traditional SAN. What is especially appealing about the technology today is the introduction of newer products that make use of 10 Gigabit Ethernet (10 GbE) to connect the storage systems. In many cases, this connection speed overcomes bottlenecks that would have required Fibre Channel in the past.

Last spring, Douglas  Gibbs, pathology bioinformatics manager at the University of Michigan in Ann Arbor, tested one of these new products from Intransa for use with a medical imaging application. “The 10-GbE infrastructure is what allows the [system] to overcome what would otherwise be front-end ‘pipe’ barriers to achieving maximum performance,” said Gibbs.

With new products like this, the research firm IDC believes the market for iSCSI is poised for rapid growth. In fact, IDC predicts revenues for these products will have a 73 percent compound annual growth rate from 2006 to 2010.

According to the industry trade group the Storage Networking Industry Association (SNIA), “the factors driving this growth have been the continuing need for IT organizations to do more with less — less capital cost, fewer administrators per Terabyte, less complexity.”

Addressing Energy Efficiency
Costs of another type are now starting to become an issue. It takes electricity to run and cool storage systems.

Like most IT and data center equipment, storage devices continue to increase in performance while physically shrinking. While the combination of higher performance and higher densities helps meet the capacity and computational requirements for life sciences research, it also increases the demand for electricity. More power is needed to run the systems and cool the densely packed (and hotter) units.

The emphasis on storage system energy efficiency was made clear in July when EMC announced annual updates to its entire product line. Normally, these announcements focus on improved performance since all of the systems take advantage of the most recent processors and disk drives. This year, performance was certainly mentioned, but so too was energy efficiency.

EMC claimed new Symmetrix, Celerra, and CLARiiON systems reduce power consumption by 33 percent. And a new version of the EMC Centera reduces power and cooling requirements by 67 percent per terabyte.

Many other storage vendors are also targeting energy efficiency. In particular, IBM, HP, EMC, Network Appliance, SGI, Quantum, and Netezza are all members of The Green Grid, a consortium of information technology companies and professionals seeking to lower the overall consumption of power in data centers around the globe.

As data volumes grow, the increased attention to energy efficiency is going hand in hand with simplified management and high performance as key criteria for storage systems to handle life sciences data.

Subscribe to Bio-IT World  magazine.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 

Tessella delivers software engineering and consulting services to leading pharmaceutical and biotech companies. We are recruiting Software Engineersto work with skilled bioinformaticians and scientists to identify business needs and recommend and develop technical solutions. Applicants require BS, MS or PhD in bioinformatics, biology or chemistry and 2+ years of software development in either: Java, C#, C++, C or VB.NET. 

Apply at http://jobs.tessella.com   

 

oxford nanopore logo 


 Early Access Collaborations Managers
Oxford Nanopore Technologies is developing a novel technology, GridIONTM for the direct, electronic analysis of DNA/RNA and other analytes.  As the system approaches the market, we are building a team of technically knowledgeable, highly motivated candidates with excellent customer service and facilitation skills to join our company as Collaboration Managers.  This is a unique opportunity to work with world-leading genomics customers throughout the early adoption phase of a new generation of DNA sequencing technology.. This is a facilitative, enabling role with responsibility for managing technology development collaborations with key customers at leading genomics institutions.  It will include long term management of the collaboration plan and milestones and associated meetings and documentation. Click here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Oxford Nanopore Technologies is a fast-moving technology company that is developing a novel electronic molecular analysis technology. The technology is adaptable for the analysis of DNA/RNA, proteins, chemicals and other molecules.  It is therefore suitable for use in a variety of markets including scientific research and clinical applications.  As the technology approaches the market, Oxford Nanopore is seeking a visionary VP of sales and marketing to join the senior team.  The candidate will embrace the opportunities afforded by entering the market with a truly disruptive technology that has the potential to expand the number of users and the variety of applications in each target market.  This is a rare opportunity to influence the commercial strategy at an early phase of its commercial lifetime, in a well funded company.  Oxford Nanopore welcomes applications from candidates with a track record of high-level strategic commercial  leadership, who wish to apply a fresh approach to existing markets.  Experience in Life Sciences/DNA sequencing is central to this role, however we will consider your application if you have experience of disruptive technologies in other related industries.  We are particularly interested in candidates with strong expertise in the use of digital technologies for sales and marketing of scientific/technical products.  Click to  Apply  


 

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .