By Michael Athanas
June 14, 2005 | You’ve got to love those old-geezer computer stories that start out “I remember when...” and what follows is some account of dropping a stack of punch cards or how glorious and macho it was to program in assembly language. Well, here’s my geezer story. I remember when my graduate advisor was the envy of other research groups because he had a 250-megabyte disk drive attached to our shared VAX workstation. We no longer had to repeatedly mount a data tape for our experimental physics data analysis. Rather, we could luxuriously store our experimental data on this spacious piece of technology. The thing made more noise than jets taking off at Logan airport and was bigger than a few large breadboxes. It was my first exposure to direct attached storage. Several weeks after the fancy disk was made available to us graduate students, it was filled to capacity, and we were swapping to tape. Are we learning anything?
Megabytes grew to gigabytes, and so on. Now the cost of storage is amazingly cheap. Approaching $2,000 per terabyte, deploying a small storage array for a small lab workgroup is becoming more affordable and accessible to non-IT specialists. Most strikingly, storage arrays are creeping below the limit of the discretionary budgets of laboratory scientists.
There are many flavors of storage falling into this price range, including direct attached and storage server arrays (see “IT’s Alive! Notes from the Lab,” October 2004 Bio•IT World, page 46). Acquiring the hardware can be as easy as providing your credit card over the Internet. Making the storage array available to your workgroup is only slightly more difficult. Managing the device after several months of usage can be daunting.
Life science laboratories use workgroup storage arrays as data repositories for laboratory devices such as microarray data collection and mass spectrometry instrumentation. Data production rates in small laboratories can exceed several gigabytes per hour — this can add up over the course of a year. Data repositories are created to hold raw data in addition to processed and intermediate analysis results. Installing a laboratory storage facility will hopefully centralize experimental data that may reside scattered across various desktop, laptop, and device control computers.
There are many commercial solutions for maintaining a laboratory data repository. For example, instrument providers sometimes offer general storage management software to manage the data derived from their instrument. More often than not, however, the general solutions don’t always quite fit into specific laboratory situations. Maintaining the storage device and assuring that it remains a useful repository in the months following the installation are the challenges. In the hands of scientists, free disk space vanishes more quickly than the social security benefit for baby boomers. Furthermore, individual files can become orphaned in the large multi-terabyte directory hierarchy.
How do you issue a storage usage policy that accommodates a diverse set of users, a diverse set of access use cases, a diverse set of laboratory devices?
First, provide users with a generously large and open workgroup access/scratch partition — a sandbox where users can stage data for processing, dump archive media, or just find a general share point for the exchange of data within the workgroup. The partition should automatically be managed by a nightly running task that removes files and directories generously older than the long-term attention span of a scientist — say, a month.
In regard to storing large experimental data files, one’s first inclination might be to construct an elaborate directory hierarchy that takes into account projects, lab machines, LIMS identifiers, dates, users, classes of experiment, and so on. Envisioning such a directory hierarchy may be as challenging as getting your lab mates to use it. Furthermore, a directory hierarchy is very difficult to adapt when faced with the evolution of requirements and access patterns.
Rather than exposing a directory hierarchy to the data repository, expose an abstraction to the physical repository. The easiest way to do this is to provide access through a Web page running on the direct attached storage’s host server. For example, a Web page could be defined with all the necessary parameters that describe a particular experimental data set for submission of instrument data into the repository. Upon submission, the parameters would be recorded in a database. If the data set were very large, it could be stored in an unexposed directory with path reference recorded in the database.
With a Web-based abstraction to the repository, users can navigate through the repository in a way that makes sense — depending on how the data parameters are exposed. This means users can sort data listings according to parameters recorded in the database. Providing access through a Web-based portal can also provide the capability of an audit trail so that historical access may be presented.
The software engineering effort involved in building a custom Web-based abstraction interface will quickly pay for itself. There are many development tools available that are designed to deploy Web portals rapidly and reliably. For example, the powerful combination of PHP and MySQL is designed specifically to make such Web portal implementations quick and painless.
Abstracting the access to the experimental data will make a buffer that spans changing usage requirements and physical implementation. Moving a terabyte amount of data around is an intimidating quantity of information and may later be good material for an old-geezer story 20 years from now.
Michael Athanas, Ph.D., is a Life Sciences Informatics Consultant and a Founding Partner of The BioTeam, a scalable informatics solutions provider. E-mail: firstname.lastname@example.org.