Adventures in XML Transformation



By Chris Dagdigian

July 20, 2005 | Jealousy can be a great professional motivator. I’ve been amazed at the clever things my colleagues have been up to recently. Using slick tools from InforSense and Scitegic, BioTeam principals have been taming clusters and bringing complex grid resources directly to desktops and exotic laboratory instruments. A recent flurry of activity in our lab has demonstrated some very interesting applications of the Automator and Widget technologies that Apple introduced with the “Tiger” 10.4 OS X release.

What do all these neat projects have in common? They are all predicated on the real-world use of XML and Web services — technologies and buzzwords that up until a year ago I had filed under the label of “massive hype; pending practical utility.”

Several months ago, sensing that the tipping point with respect to in-the-trenches usefulness had been reached, I signed up for a Web services course at a nearby university in Cambridge. The goal was to get an updated picture on the current state of the art and see what could be applied to our own work.

Consulting organizations such as The BioTeam operate right at the ragged intersection of life science research and applied information technology. Only rarely does a prepackaged or commercial solution exist for the types of problems consultants are asked to solve. Because of this, front-line consultants have great respect for practical technologies that can be applied directly to challenging software, science, or infrastructure integration issues. For a toolsmith, finding a powerful new tool is always a cause for celebration.

One of these celebratory moments came while sitting in the “Web Services and Service-Oriented Architectures” lecture devoted to issues surrounding XML parsing, searching, and transformation. It became clear that the XML-related World Wide Web Consortium (W3C) recommendations known as XSLT and XPATH were the tools necessary to address a personal project of long-standing interest: monitoring clusters and grids running Sun Microsystems’ Grid Engine 6 distributed resource management software.

Grid Engine is software that I’ve written about extensively before. The latest release contains a newly added feature that, so far, few people have gotten around to using or writing about: the ability to output detailed grid and job status information in raw XML form.

Previous efforts at building Web-based Grid Engine monitoring tools (including BioTeam’s efforts) simply involved grabbing the human-readable text output from the “qstat” program and using Perl to mark it up for display in HTML browsers. The results of such efforts are usable and useful but limited in power and flexibility because of the difficulties of handling text formatted primarily for human readability rather than automated parsing.

When Grid Engine 6 first came out, we experimented with various XML output options just to see what the data looked like. The XML output contained data and details unavailable with any other method and was an excellent source of raw grid status information.

Data Transformers

Structured data are nice and look easy to parse, but parsing is only half the battle — figuring out what to do with selected data is a different matter entirely. Knowing enough to realize the task was nontrivial, I shelved my personal experiments with XML Grid Engine data — that is, until the night the Web services course instructor introduced XSLT and XPATH as part of an early lecture. It turns out that XSLT and XPATH were the missing pieces necessary for the Grid Engine monitoring project to proceed.

XPATH is a W3C-recommended language for “addressing parts of an XML document.” It models an XML document as a tree of nodes, and standard XPATH syntax allows one to search for nodes that match certain patterns or conditions. In simple terms, XPATH allows one to do basic “search-and-select” actions on XML documents, pulling out nodes or nodelists of interest for further processing or analysis. It has been specifically designed to complement other XML technologies such as XPointer and XSLT.

Where raw Grid Engine XML status data are concerned, XPATH is the technology that allows one to cut through the large volume of data to make targeted queries. Queries such as “Give me information about all pending grid jobs” would be represented as an XPATH search string: “//job_list[@state=‘pending’]”.

The use of XPATH addresses one problem: “How do I wade through lots of XML and pick out the bits that I’m actually interested in?” This is only a partial solution, as one still has to do something interesting (or at least visually pleasing) with the selected XML data. This is where another W3C recommendation comes into play: XSLT 1.0.

XSLT is a language used to transform one XML document into another XML document. Actually, one can transform XML documents into arbitrary formats including plain text and simple HTML.

The combination of XPATH and XSLT revived the Grid Engine monitoring project and enabled it to make significant progress in a few short weeks of nights-and-weekends hacking. XPATH is used to pull interesting data from the large volume of raw XML generated by Grid Engine, and XSLT is the method by which the resulting XML is transformed into text, HTML, and even syndicated RSS feeds.

The process is surprisingly easy, and a rich and responsive Web interface for Grid Engine clusters was quickly developed. The hardest part of the entire process had nothing to do with XML — far more time was spent on Web and interface design issues involving CSS stylesheets, cross-browser display problems, and JavaScript than on the mechanics of pulling and manipulating XML data from Grid Engine.

Most impressive is the power and freedom that XSLT technology provides to quickly manipulate, create, change, and alter the output from the XML transformation operations. When initial feedback on the monitoring interface was received, consensus was that it was fine for small clusters, but an information-heavy interface would quickly become unusable on clusters with thousands of grid nodes and jobs. It took several hours to develop a new “terse” XSLT stylesheet customized for use on very large grid systems. When a second beta tester mentioned that it would “be cool to browse my jobs via the new RSS-aware Apple Safari Web browser...” it took just 30 minutes to create a prototype of an XSLT stylesheet capable of transforming Grid Engine XML into RSS-2.0-compliant RSS output.

The “xml-qstat” project is released under a Creative Commons license. Visit http://xml-qstat.bioteam.net for documentation, screenshots, all the source code, and links to live demonstration sites.

Even if you don’t use Grid Engine, feel free to browse the XSLT stylesheets to see how easy it is to turn XML into customized HTML, text, and RSS documents. Happy hacking!

Chris Dagdigian is a self-described infrastructure geek currently employed by The BioTeam. E-mail: chris@bioteam.net.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 
Apply at http://jobs.tessella.com   

oxford nanopore logo 


Early Access Collaborations ManagersClick here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Click to  Apply  





For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .