November 10, 2009 | Signature Supplement | Greek mythology grabs your attention as a hero follows a complicated path to his final destination. Today, IT departments in life sciences organizations have their own attention grabber: Understanding the path and the best way to move data from its creation through its lifecycle.
While the story about the hero usually has a moral and philosophical implication, the path data and how it moves has significant consequences on the performance and economics of storage solutions.
The questions that need to be answered are: What’s the most suitable and economical place to store data at various times in its lifetime? And how can data be moved in a manner that eliminates any user disruptions and minimizes the time to conduct
For years, organizations have explored hierarchical storage management and information lifecycle management. And in fact, many organizations have deployed systems to move data from storage tier to storage tier as it got older or was not used as often.
But such approaches do not match today’s data management requirements.
Obviously, the volume of data that must be managed today is much larger than in the past. But other factors come into play making data management more complicated than it has been before.
In the past, once data was created, say a sequencer experiment was run, that data was initially stored, analyzed soon after, and then archived to tape or another removable medium or even destroyed. This process lent itself to a simple data migration strategy.
So what complicates matters now? In life sciences organizations today, data of all sorts is being kept longer. Blogs, wikis, and other Web 2.0 applications tend to last forever. Once results of an experiment are posted, they remain online for good. And raw experimental data is often kept longer as life sciences organizations frequently re-analyze raw data seeking new indications for previously discovered entities.
Additionally, regulatory and new patient safety requirements continue to dictate that more data must be retained for longer periods. This impacts data management in that you must decide which data gets stored on which systems at a particular point in that data’s lifetime.
Commercial life sciences organizations — being businesses — are now subject to eDiscovery laws. Such laws require companies in litigation to quickly produce email, files, and other digital assets once information is subpoenaed.
To put the significance of eDiscovery into perspective in today’s litigious world, companies with more than $500 million in revenue may routinely face five or more litigation matters each year, according to the IT trade publication eWeek. Again, this has implications into how data is managed and how it is stored to facilitate fast search and retrieval times.
All of these factors make it harder to apply simple classification or categorization criteria to data. For example, simply tagging data older than six months as ripe for archiving doesn’t cut it today. This in turn, makes traditional approaches to data migration less effective and creation of data retention policies more complicated.
Adding another level of complexity to the problem is the fact that a great deal of life sciences research is carried out using high performance computational analysis workflows. Such workflows require that the right data to be on the right storage device at that right time.
In such workflows older experimental data might get priority over newer data if research -associated with that data rises in importance to the organization. So simply moving older data to slower performance storage based solely on its age will not work in today’s dynamic research and development environments.
Optimizing Data Movement and Placement
Most life sciences research and development organizations rely on industrialized computational pipelines to quickly process, analyze, and visualize data so that decisions about which candidates to pursue or which experiments to do next can be made as early as possible.
Many organizations have made huge investments in their IT infrastructure to keep these pipelines running smoothly and efficiently. For instance, it is quite common to find Linux clusters of servers based on the latest multi-core processors and 10-Gigabit Ethernet interconnection devices as the backbone of a high performance computing facility.
Naturally, to speed the computational workflows and to keep these systems running at peak performance, these systems must be served by high performance storage systems. Specifically, the storage systems must be able to maintain the high data throughputs required to keep the computational systems’ appetite for raw data constantly fed.
The trick in this one case is to figure out which data gets placed on the high performance storage systems, how long it stays there, and then where to put it next. For instance, it makes no sense to host data that is not part of a high throughput computational pipeline on the highest performance storage devices. Such data could be placed on lower performance and more economical storage.
Similarly, what do you do with the lab data once it has been processed, analyzed, or visualized? After all, it doesn’t have to stay on the premium storage drives.
And what should be archived? In today’s dynamic research environment, a slight change in market conditions might cast new light on a previously dismissed drug candidate or bring attention to a new indication for an existing drug. As such, data that was slated for the archives (or disposal) would need to be readily available.
It seems nothing short of manually assigning and moving each byte of data to the storage drive that is the most appropriate for that data at that moment will do. But such an approach is incredibly unrealistic.
Since different types of data can have different lifecycles or different intrinsic values, it is desirable to design application-specific or user-specific tiers of storage for each stage of the data lifecycle.
As multiple storage technologies are often the most appropriate match to each point in the lifecycle, the concept of tiered storage is central to an effective data management strategy. Certain tiers may be architected with different performance characteristics in mind, or for better cost-effectiveness, or just so that the data they contain is bound to certain processes, applications, users, or groups. But it is important to remember the concept of tiering is not limited to the different types of storage hardware available. One might define different tiers for different instruments, different projects, different research groups, or even different days of the week. What is important is that the data is more efficiently classified according to how it is used within the research organization.
It is difficult enough to mine data for relevance on a laptop with one or two million files. For petascale filesystems with billions and billions of files, a single giant filesystem is useless for data classification. One would spend more time searching than actually doing anything with the results.
Any storage tiering approach by itself offers only limited advantages. What really provides value is the ability to transparently move data from tier to tier, keeping a single filesystem presentation to the hosts, users, and applications. This eliminates the need for changes such as redirecting an application to a new drive or volume when a file is moved.
High performance network storage systems provider BlueArc calls this type of data movement with its single filesystem view, Transparent Data Mobility (TDM).
Such mobility includes several components. To start, there is Intelligent Tiered Storage, which provides IT organizations with a single, integrated system to support all stages of the information management lifecycle.
With Intelligent Tiered Storage, online, nearline, and archival data can reside on any combination of solid state, Fibre Channel, and SATA disks. Also, BlueArc automati¬cally and intelligently migrates data across various tiers using policies established by the storage administrator. These policies can be based on a wide range of parameters such as when the data was last accessed, the owner of the data, and the amount of disk space available, etc.
Intelligent Tiered Storage allows IT organizations to optimize storage efficiency by matching the storage media to the specific requirements of each supported work load. For example, high-performance Fibre Channel disk stores frequently accessed, real-time, and high-priority data while more cost effective SATA disk stores less frequently accessed files. BlueArc automatically and intelligently migrates the less frequently accessed data to lower cost storage, whether SATA or an external archival system.
To move the data and enable TDM, BlueArc uses Data Migrator, a policy-based engine allowing administrators to implement data movement policies. Data Migrator works by allowing administrators to define policies, or even hierarchies of policies, which classify data and move that data from tier-to-tier based on the defined criteria. Metadata attributes such as file type, file size, user or group ownership of file, last time of access, and dozens of other variables can be used to craft extremely effective data movement policies. Data movement may also be scheduled. Different policies may be defined based on available free space, thus allowing for more aggressive migration policies when space is low.
With Data Migrator, storage managers also have a “what if” option that lets them craft a policy and analyze its impact on the various storage tiers, but without actually implementing the policy and initiating data movement. Such a tool is useful to compare various hypothetical data management scenarios to select the best one for the application environment at hand.
How is this different from traditional information lifecycle management (ILM) solutions? In a word: Transparency. Data Migrator is transparent to end-users and applications. Specifically, it addresses one of the biggest challenges with out-of-band ILM solutions. When data is moved in such systems, users must be notified of the new data location and applications have to be “reconnected” to the relocated files.
Because Data Migrator is an embedded feature of BlueArc filesystem, or SiliconFS™, all filesystem functions work seamlessly as if the data were still on the original storage tier. Users and applications see the data as if it still existed in the original locations, while SiliconFS keeps track of where the data actually resides. For this reason BlueArc’s Data Migrator is often described by storage analysts as a “transparent, policy-based data migration engine” for implementing ILM policies.
Many organizations use a broad mix of storage solutions, either to implement multi-vendor strategies or just to select the best technologies available. BlueArc offers a unique differentiating feature which extends the TDM benefits of Data Migrator using another technology called Cross-Volume Links (XVL). Such links are zero-length files that reside on the primary filesystem and point at the corresponding file on the secondary system, which houses the migrated data file. All of the metadata required for directory level operations (including owner, access mode, and access-control lists, or ACLs) are maintained on the primary filesystem.
The utility of the Cross-Volume Links to the TDM strategy becomes obvious once data is migrated to external storage devices. Cross-Volume Links are designed to operate either with internal BlueArc storage tiers or external, 3rd-party storage devices. It is the incorporation of external storage devices which greatly extends the reach of Data Migrator, and thus the entire BlueArc TDM strategy. TDM with XVL can be used to repurpose existing storage investments, or to incorporate non-BlueArc storage as transparent tiers within the same namespace.
Data can be migrated from tier-to-tier-to-tier, even to external tiers, and still be managed and presented to hosts and applications as a single cohesive whole. While the use of external devices as remote target filesystems is currently limited to those devices which can be accessed via NFS or HTTP protocols, — future versions of the XVL technology will make use of additional protocols, greatly expanding the list of 3rd-party devices which could be incorporated into a BlueArc TDM strategy.
BlueArc offers other technologies that further complement Data Migrator.
The first is Dynamic Caching, a feature that reserves space on a storage tier for caching of “hot” files. If the cache is created in a high-performance tier of storage, this guarantees that any hot files are automatically on the highest performance disk tier. Having the cache obviates the need for reverse data migration – why move the data back to the originating tier if a copy of it already exists on the highest performance tier?
Cluster Read Caching is Dynamic Caching applied to a cluster of BlueArc servers (i.e., many servers under a single namespace) or it may be applied to single BlueArc server. In the latter case the feature is called Local Read Caching. When used with a cluster of BlueArc servers, each server maintains its own Dynamic Cache, but is aware of the files accessed by all the other servers in the cluster. Copies of hot files from anywhere in the cluster therefore make their way to every cache on every BlueArc server, which can result in dramatic aggregate read performance improvements since every server can respond to any read request for a given set of hot files.
Policy-driven and fully automated, the Dynamic Caching transparently monitors file access patterns and caches only those files necessary to satisfy individual host and application requests received by SiliconFS. Applications with read-intensive workload profiles and a need to stage data in an optimized workflow process can leverage read caching as a way to scale performance when and how they need it.
Dynamic Caching can help significantly improve the performance of many life sciences high throughput computational analysis workflows. In particular, wherever storage systems are hitting hard limitations with performance or scalable and sustainable client/server access, dynamic read caching can help achieve new levels of optimization and speed time to results.
To address another aspect of TDM, BlueArc offers three data relocation mechanisms. These include:
- Enterprise Virtual Server (EVS) Migration, which makes it possible to relocate a virtual server within a cluster or to a server outside of the cluster that share access to the same storage devices. EVS migration is typically used for adjusting workflows or vacating a server for scheduled maintenance.
- Filesystem relocation, which allows any filesystem accessed via Cluster Namespace to be relocated to another server within the cluster. Filesystem relocation is typically used to load balance within the unified namespace.
- Data relocation which allows data to be relocated from any given filesystem to another using a mechanism referred to as Transfer of Primary Access (TPA). TPA is generally used to better organize filesystems or directories within them.
Used together, all of these technologies give life sciences organizations a way to automate data migration while ensuring minimal disruption for user and optimized performance for applications.
Technology for Transparent Data Mobility
To complement its hardware, BlueArc offers a number of technologies that help reduce storage management costs, optimize the use of storage resources, and automatically move data to an appropriate storage tier.
These technologies include:
• Data Migrator: A policy-based engine allowing administrators to implement data movement policies
• Cross-Volume Links: Extends the reach of Data Migrator policies to third-party devices, all incorporated transparently into the same namespace
•Cluster Namespace: Unifying technology to make all storage tiers appear as a single transparent whole
Dynamic Caching: A policy-driven and automated feature that reserves space on a storage tier for caching of “hot” files
• Cluster Read Caching: Caching applied to a cluster of BlueArc servers
• Enterprise Virtual Server Migration: A feature that makes it possible to relocate a virtual server within a cluster or to a server outside of the cluster that share access to the same storage devices
• Filesystem Relocation: A feature that allows any filesystem accessed via Cluster Namespace to be relocated to another server within the cluster
• Data Relocation: A feature that allows data to be relocated from any given filesystem to another
BlueArc Data Migrator
A policy-based solution that lets administrators easily and automatically migrate data between tiers of storage or to file systems on external connected devices
• Automates data migration via a rules-based policy engine
• Eliminates management complexity
• Helps companies realize the benefits of Transparent Data Mobility
• Includes policy templates and scheduler
• “What if” tool lets managers analyze the impact of a policy without actually implementing the policy or initiating data movement
BlueArc’s Transparent Data Migration solution makes use of Cross-Volume Links, which reside on the primary filesystem and point at the corresponding file on a secondary system that houses the migrated data file.
• Extends Data Migrator automation to external devices to support archiving, deduplication, compression, or repurposing of third-party devices
• Allows repurposing of existing capacity to drive additional value from aging
• Enables all management and quota tracking as if applications were using the primary storage tier
Dynamic Read Caching
A policy-driven and automated feature that reserves space on a storage tier for caching of “hot” files
• Eliminates bottlenecks and improves performance of NFS read operations
• Aggregates throughput across a cluster for larger numbers of hosts
• Reduces manual data administration tasks with automated cache management
• Reduces over-provisioning of storage
• Saves costs by reducing the amount of high performance storage needed