YouTube Facebook LinkedIn Google+ Twitter Xingrss  

Adventures in XML Transformation


By Chris Dagdigian

July 20, 2005 | Jealousy can be a great professional motivator. I’ve been amazed at the clever things my colleagues have been up to recently. Using slick tools from InforSense and Scitegic, BioTeam principals have been taming clusters and bringing complex grid resources directly to desktops and exotic laboratory instruments. A recent flurry of activity in our lab has demonstrated some very interesting applications of the Automator and Widget technologies that Apple introduced with the “Tiger” 10.4 OS X release.

What do all these neat projects have in common? They are all predicated on the real-world use of XML and Web services — technologies and buzzwords that up until a year ago I had filed under the label of “massive hype; pending practical utility.”

Several months ago, sensing that the tipping point with respect to in-the-trenches usefulness had been reached, I signed up for a Web services course at a nearby university in Cambridge. The goal was to get an updated picture on the current state of the art and see what could be applied to our own work.

Consulting organizations such as The BioTeam operate right at the ragged intersection of life science research and applied information technology. Only rarely does a prepackaged or commercial solution exist for the types of problems consultants are asked to solve. Because of this, front-line consultants have great respect for practical technologies that can be applied directly to challenging software, science, or infrastructure integration issues. For a toolsmith, finding a powerful new tool is always a cause for celebration.

One of these celebratory moments came while sitting in the “Web Services and Service-Oriented Architectures” lecture devoted to issues surrounding XML parsing, searching, and transformation. It became clear that the XML-related World Wide Web Consortium (W3C) recommendations known as XSLT and XPATH were the tools necessary to address a personal project of long-standing interest: monitoring clusters and grids running Sun Microsystems’ Grid Engine 6 distributed resource management software.

Grid Engine is software that I’ve written about extensively before. The latest release contains a newly added feature that, so far, few people have gotten around to using or writing about: the ability to output detailed grid and job status information in raw XML form.

Previous efforts at building Web-based Grid Engine monitoring tools (including BioTeam’s efforts) simply involved grabbing the human-readable text output from the “qstat” program and using Perl to mark it up for display in HTML browsers. The results of such efforts are usable and useful but limited in power and flexibility because of the difficulties of handling text formatted primarily for human readability rather than automated parsing.

When Grid Engine 6 first came out, we experimented with various XML output options just to see what the data looked like. The XML output contained data and details unavailable with any other method and was an excellent source of raw grid status information.

Data Transformers

Structured data are nice and look easy to parse, but parsing is only half the battle — figuring out what to do with selected data is a different matter entirely. Knowing enough to realize the task was nontrivial, I shelved my personal experiments with XML Grid Engine data — that is, until the night the Web services course instructor introduced XSLT and XPATH as part of an early lecture. It turns out that XSLT and XPATH were the missing pieces necessary for the Grid Engine monitoring project to proceed.

XPATH is a W3C-recommended language for “addressing parts of an XML document.” It models an XML document as a tree of nodes, and standard XPATH syntax allows one to search for nodes that match certain patterns or conditions. In simple terms, XPATH allows one to do basic “search-and-select” actions on XML documents, pulling out nodes or nodelists of interest for further processing or analysis. It has been specifically designed to complement other XML technologies such as XPointer and XSLT.

Where raw Grid Engine XML status data are concerned, XPATH is the technology that allows one to cut through the large volume of data to make targeted queries. Queries such as “Give me information about all pending grid jobs” would be represented as an XPATH search string: “//job_list[@state=‘pending’]”.

The use of XPATH addresses one problem: “How do I wade through lots of XML and pick out the bits that I’m actually interested in?” This is only a partial solution, as one still has to do something interesting (or at least visually pleasing) with the selected XML data. This is where another W3C recommendation comes into play: XSLT 1.0.

XSLT is a language used to transform one XML document into another XML document. Actually, one can transform XML documents into arbitrary formats including plain text and simple HTML.

The combination of XPATH and XSLT revived the Grid Engine monitoring project and enabled it to make significant progress in a few short weeks of nights-and-weekends hacking. XPATH is used to pull interesting data from the large volume of raw XML generated by Grid Engine, and XSLT is the method by which the resulting XML is transformed into text, HTML, and even syndicated RSS feeds.

The process is surprisingly easy, and a rich and responsive Web interface for Grid Engine clusters was quickly developed. The hardest part of the entire process had nothing to do with XML — far more time was spent on Web and interface design issues involving CSS stylesheets, cross-browser display problems, and JavaScript than on the mechanics of pulling and manipulating XML data from Grid Engine.

Most impressive is the power and freedom that XSLT technology provides to quickly manipulate, create, change, and alter the output from the XML transformation operations. When initial feedback on the monitoring interface was received, consensus was that it was fine for small clusters, but an information-heavy interface would quickly become unusable on clusters with thousands of grid nodes and jobs. It took several hours to develop a new “terse” XSLT stylesheet customized for use on very large grid systems. When a second beta tester mentioned that it would “be cool to browse my jobs via the new RSS-aware Apple Safari Web browser...” it took just 30 minutes to create a prototype of an XSLT stylesheet capable of transforming Grid Engine XML into RSS-2.0-compliant RSS output.

The “xml-qstat” project is released under a Creative Commons license. Visit http://xml-qstat.bioteam.net for documentation, screenshots, all the source code, and links to live demonstration sites.

Even if you don’t use Grid Engine, feel free to browse the XSLT stylesheets to see how easy it is to turn XML into customized HTML, text, and RSS documents. Happy hacking!

Chris Dagdigian is a self-described infrastructure geek currently employed by The BioTeam. E-mail: chris@bioteam.net.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1





For reprints and/or copyright permission, please contact  Jay Mulhern, (781) 972-1359, jmulhern@healthtech.com.