Adventures in XML Transformation


By Chris Dagdigian

July 20, 2005 | Jealousy can be a great professional motivator. I’ve been amazed at the clever things my colleagues have been up to recently. Using slick tools from InforSense and Scitegic, BioTeam principals have been taming clusters and bringing complex grid resources directly to desktops and exotic laboratory instruments. A recent flurry of activity in our lab has demonstrated some very interesting applications of the Automator and Widget technologies that Apple introduced with the “Tiger” 10.4 OS X release.

What do all these neat projects have in common? They are all predicated on the real-world use of XML and Web services — technologies and buzzwords that up until a year ago I had filed under the label of “massive hype; pending practical utility.”

Several months ago, sensing that the tipping point with respect to in-the-trenches usefulness had been reached, I signed up for a Web services course at a nearby university in Cambridge. The goal was to get an updated picture on the current state of the art and see what could be applied to our own work.

Consulting organizations such as The BioTeam operate right at the ragged intersection of life science research and applied information technology. Only rarely does a prepackaged or commercial solution exist for the types of problems consultants are asked to solve. Because of this, front-line consultants have great respect for practical technologies that can be applied directly to challenging software, science, or infrastructure integration issues. For a toolsmith, finding a powerful new tool is always a cause for celebration.

One of these celebratory moments came while sitting in the “Web Services and Service-Oriented Architectures” lecture devoted to issues surrounding XML parsing, searching, and transformation. It became clear that the XML-related World Wide Web Consortium (W3C) recommendations known as XSLT and XPATH were the tools necessary to address a personal project of long-standing interest: monitoring clusters and grids running Sun Microsystems’ Grid Engine 6 distributed resource management software.

Grid Engine is software that I’ve written about extensively before. The latest release contains a newly added feature that, so far, few people have gotten around to using or writing about: the ability to output detailed grid and job status information in raw XML form.

Previous efforts at building Web-based Grid Engine monitoring tools (including BioTeam’s efforts) simply involved grabbing the human-readable text output from the “qstat” program and using Perl to mark it up for display in HTML browsers. The results of such efforts are usable and useful but limited in power and flexibility because of the difficulties of handling text formatted primarily for human readability rather than automated parsing.

When Grid Engine 6 first came out, we experimented with various XML output options just to see what the data looked like. The XML output contained data and details unavailable with any other method and was an excellent source of raw grid status information.

Data Transformers

Structured data are nice and look easy to parse, but parsing is only half the battle — figuring out what to do with selected data is a different matter entirely. Knowing enough to realize the task was nontrivial, I shelved my personal experiments with XML Grid Engine data — that is, until the night the Web services course instructor introduced XSLT and XPATH as part of an early lecture. It turns out that XSLT and XPATH were the missing pieces necessary for the Grid Engine monitoring project to proceed.

XPATH is a W3C-recommended language for “addressing parts of an XML document.” It models an XML document as a tree of nodes, and standard XPATH syntax allows one to search for nodes that match certain patterns or conditions. In simple terms, XPATH allows one to do basic “search-and-select” actions on XML documents, pulling out nodes or nodelists of interest for further processing or analysis. It has been specifically designed to complement other XML technologies such as XPointer and XSLT.

Where raw Grid Engine XML status data are concerned, XPATH is the technology that allows one to cut through the large volume of data to make targeted queries. Queries such as “Give me information about all pending grid jobs” would be represented as an XPATH search string: “//job_list[@state=‘pending’]”.

The use of XPATH addresses one problem: “How do I wade through lots of XML and pick out the bits that I’m actually interested in?” This is only a partial solution, as one still has to do something interesting (or at least visually pleasing) with the selected XML data. This is where another W3C recommendation comes into play: XSLT 1.0.

XSLT is a language used to transform one XML document into another XML document. Actually, one can transform XML documents into arbitrary formats including plain text and simple HTML.

The combination of XPATH and XSLT revived the Grid Engine monitoring project and enabled it to make significant progress in a few short weeks of nights-and-weekends hacking. XPATH is used to pull interesting data from the large volume of raw XML generated by Grid Engine, and XSLT is the method by which the resulting XML is transformed into text, HTML, and even syndicated RSS feeds.

The process is surprisingly easy, and a rich and responsive Web interface for Grid Engine clusters was quickly developed. The hardest part of the entire process had nothing to do with XML — far more time was spent on Web and interface design issues involving CSS stylesheets, cross-browser display problems, and JavaScript than on the mechanics of pulling and manipulating XML data from Grid Engine.

Most impressive is the power and freedom that XSLT technology provides to quickly manipulate, create, change, and alter the output from the XML transformation operations. When initial feedback on the monitoring interface was received, consensus was that it was fine for small clusters, but an information-heavy interface would quickly become unusable on clusters with thousands of grid nodes and jobs. It took several hours to develop a new “terse” XSLT stylesheet customized for use on very large grid systems. When a second beta tester mentioned that it would “be cool to browse my jobs via the new RSS-aware Apple Safari Web browser...” it took just 30 minutes to create a prototype of an XSLT stylesheet capable of transforming Grid Engine XML into RSS-2.0-compliant RSS output.

The “xml-qstat” project is released under a Creative Commons license. Visit http://xml-qstat.bioteam.net for documentation, screenshots, all the source code, and links to live demonstration sites.

Even if you don’t use Grid Engine, feel free to browse the XSLT stylesheets to see how easy it is to turn XML into customized HTML, text, and RSS documents. Happy hacking!

Chris Dagdigian is a self-described infrastructure geek currently employed by The BioTeam. E-mail: chris@bioteam.net.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

isilon white paper

“Storage for Science – Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments” sponsored by Isilon
Large and rapidly growing stores of file-based and other data are a hallmark of life science research and bioinformatics. Determining how best to manage those data stores has become a significant challenge for Researchers and IT Pros alike.

This paper is intended to:

  • Provide guidance on the many storage requirements common to Life Science research;
  • Explain the evolution of modern storage architectures;
  • Summarize the major data storage architectures currently in use.

Additionally, it will present the Isilon IQ clustered storage product as a strong and flexible solution to those needs. Download now



definiens briefingon-76Next-Generation Technologies Revolutionizing Oncology and Diagnostics
underwritten by Definiens

This “Briefing On” collection of Bio-IT World features, commentaries and analysis, presents some of the latest thinking on high-throughput technologies that are being applied to the fields of research and drug discovery, with particular emphasis on oncology, diagnostics and imaging technologies. Download now at no charge compliments of the underwriting sponsor, Definiens. Download This Free Paper



metaminer image(1)

MetaMiner™ Cystic Fibrosis Report,  Sponsored by GeneGo
This paper discusses the MetaMiner™ (CF) data analysis platform for a broad range of CF researchers designed to: 1. Easily assemble important biological and chemical experimental data available today in cystic fibrosis research. 2. Visualize key mechanisms leading to the disease through pathway maps and network models 3. Provide the CF community a “one stop shop” tool for uploading and analyzing experimental data in a disease-centered interface.  Download now 



Life Science Webcasts & Podcasts

Storage for Science
Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments

Sponsored by Isilon

Isilon webcast1

Large and rapidly growing stores of file-based and other data are a hallmark of life science research and bioinformatics environments. Determining how best to manage those data stores has become a significant challenge for the Researchers and IT Professionals that support them.

This webcast is intended to: 

  • Provide guidance on the many storage requirements common to Life Science research; 
  • Explain the evolution of modern data storage architectures; 
  • Summarize the major data storage architectures currently in use;
  • Present the Isilon IQ clustered storage product as a strong and flexible solution to those needs.

    Download this webcast

More Podcasts

Job Openings

Isilon Systems ~ Senior Marketing Communications Manager
Isilon Systems is the worldwide leader in clustered storage systems and software for digital content and unstructured data. We seek an experienced marketing communications professional/writer expert in creating and delivering effective and persuasive business communications. The ideal candidate can think at the strategic and conceptual level and act, simultaneously, as a highly-effective and productive individual contributor. The position is based in Seattle, WA. For additional information click here:
 

Lilly Singapore Center for Drug Discovery (LSCDD) - Associate Director of Informatics
Lead and mentor a strong team for the Bioinformatics group at the Integrative Computational Sciences (ICS) department at LSCDD towards the development of novel algorithms, data analysis methods and software tools for drug discovery. Work closely with the Software Engineering group at ICS, and collaborate with the Discovery IT organization in Europe and USA. For additional information, or to apply visit: LSCDD 





For reprints and/or copyright permission, please contact RMS, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125 or via email to bio-itworld@theygsgroup.com.