Workflow Environments Guide


By Chris Dwan

Aug 15, 2005 | During our recent work with the Web services interface to iNquiry, BioTeam has gained familiarity with several graphical workflow packages for scientific computing. These tools have been gathering acceptance in bioinformatics, genomics, and general scientific computing groups from large pharmaceutical companies to single investigators.

I’ve compiled a short list of the features that I use to differentiate these offerings when selecting the one that is most appropriate for a particular user. As with many technology decisions, the choice of a workflow environment is seldom clear. Many factors must be weighed in the context of user requirements, local expertise, and required features.

The packages I’ve worked with are Taverna, a free, open-source workflow environment produced as part of the MyGrid project; InforSense; and Scitegic’s Pipeline Pilot, commercial products with robust features and enterprise-level support; and Apple’s Automator. Apple has built Web services capabilities into their Tiger operating system, and Automator is a way to access these services. Packages I simply have not yet had the time to try out are TurboWorx, the Broad Institute’s GenePattern,  and VIBE from Incogen.

Features I use to differentiate between offerings are:

Support for basic programmatic constructs. While graphical environments will never replace traditional interpreted or compiled programs, they should still support the full range of language constructs required to implement arbitrary algorithms. This includes conditional execution (if/else), loops (do/while), and rudimentary variables. These features are absolutely essential to developing large, complex protocols.

Multiple inputs/outputs for modules. Useful modules produce multiple input and output streams.

Failure handling. Developing workflows for a complex, heterogeneous, highly connected infrastructure requires what might be called defensive programming. Errors will inevitably occur outside the purview of the developer. Workflow environments need to provide easy access to underlying error codes and messages, as well as clear notification as to which steps in a process failed and need to be recomputed. A clean way to differentiate between transient and permanent errors would be a huge plus.

Cached results/partial reexecution. For me, at least, debugging requires running a process over and over again, working out the errors from beginning to end. The ability to selectively reexecute those portions of a workflow that have changed or depend on those changed modules helps accelerate this process.

User interaction/steering. Some processes (particularly those relevant to a bench scientist) require interaction and decision making in the middle. While it is simple enough to create N+1 workflows for a process with N user interactions, it is better to explicitly support user choice, input, and notification without stopping and restarting the entire pipeline. A very-high-level version of this would involve publishing process status notification via an RSS feed or similar technology. Of course, this would only encourage the Blackberry crowd to check their processes more frequently than they already do.

Ease of relocation. Perhaps the best part of Web services technology is the fact that services are explicitly virtualized. In theory, this means that workflows should be entirely portable. Workflow environments should make it simple to point a particular action at a different service provider. If I publish a workflow that points at a set of services on my cluster/database/grid then remote users should be able to redirect each call to their local resource with minimal effort.

Revision history. As workflows become part of the enterprise environment, they will need the same sort of revision control as any other document. For workflows saved as XML files, this can be simply implemented with a revision control system such as RCS, CVS, or SVN. Robust integration with the workflow environment itself is a big plus.

Command line execution. The emerging-use model for workflows appears to be that expert developers will create protocols for use by others. This means that in many cases, the workflows themselves will be pieces in other automated systems. Therefore, they must support execution from the command line and thus automated or remote invocation.

Encapsulated scripting. No environment will ever provide every possible module. One of the most powerful features I’ve seen in any of these tools is the ability to very simply define a “script wrapper” action. Of course, this could lead to abuses of the environment such as wrapping an existing monolithic PERL script in a single action and declaring it a workflow.

Disconnect/reconnect. Production workflows must support long running processes. In the extreme case, some pipelines will run perpetually, receiving new data from automated instruments. I simply cannot endorse any product that requires me to leave my laptop connected to the Internet for my jobs to run.

Process encapsulation. Both of the commercial offerings allow me to wrap up a set of calls into the equivalent of a subroutine and then to republish that subroutine as a Web service using WSDL and SOAP. This is absolutely imperative for many reasons, not least of which is the fact that the whole point of a graphical workflow system is to mitigate complexity and provide a clear and simple view of the process being implemented. When workflows require wall-sized posters to display, they no longer serve that purpose.

Parse WSDL; speak SOAP. This seems self-evident to me: Any new programmatic technology should make use of Web services and discoverable resources.

I’m certain that this is not an exhaustive list. These are just a few points that I’ve seen in a couple of months of working with the technology.

The compelling differentiator for me comes down to user expectations and needs. An academic lab with limited financial resources will find the free and open-source tools appealing. Corporations with enterprise-level computing needs tend to be willing to pay a premium for tools with support teams to back them up. The technology is still young and malleable enough that both groups will find plenty of opportunity to do great and interesting things, and these graphical environments provide a valuable addition to the scientific computing toolbox.

 

Chris Dwan is a senior consultant with The BioTeam. E-mail: cdwan@bioteam.net.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

definiens briefingon-76Next-Generation Technologies Revolutionizing Oncology and Diagnostics
underwritten by Definiens

This “Briefing On” collection of Bio-IT World features, commentaries and analysis, presents some of the latest thinking on high-throughput technologies that are being applied to the fields of research and drug discovery, with particular emphasis on oncology, diagnostics and imaging technologies. Download now at no charge compliments of the underwriting sponsor, Definiens. Download This Free Paper



gq nxt gen seq

This Bio•IT World Briefing On “Next-Generation Sequencing,” underwritten by GenomeQuest, Inc.,
presents a selection of feature stories, interviews,commentaries, conference reports, and editorials on the emergence, opportunities, and challenges posed by high-throughput sequencing. Covered in this collection: the launch of new platforms from Applied Biosystems and Helicos; new applications of nextgen sequencing; the rise of personal genomics; and informatics solutions to vexing problem of managing the vast volumes of next-gen data.  Download now 



Life Science Webcasts & Podcasts

GenoLogicsgenologics 2 translational
Enabling Translational Research Informatics

Learn about the challenges facing life sciences research labs to manage their translational research data:

  • The trends for organizations to adopt informatics solutions for translational research.
  • The unique requirements with managing complex data and workflow.
  • What labs should consider when reviewing informatics solutions for translational research.
  • Which life sciences research organizations are successfully adopting an informatics solution.

Download Now



More Podcasts

Job Openings

Assistant Editor (Science Writer)~Cambridge Healthtech Institute (CHI), Needham, MA, 
Cambridge Healthtech Institute seeks an assistant editor (science writer) who is an ambitious, dependable journalist who can fulfill a range of writing and editorial duties for a series of eNewsletters covering various aspects of the biopharmaceutical industry in addition to CHI’s flagship publication, Bio-IT World magazine.  This is a superb opportunity to make important contributions to the growth and success of a multimedia science publishing group, while gaining invaluable experience in multiple facets of the publishing industry.   Interested candidates should submit a cover letter, including 3 writing samples (attached in Word or PDF format), salary history or requirements, and resume to kdavies@healthtech.com. 

Fred Hutchinson Cancer Research Center: IT Business Analyst III
The Hutchinson Center is the only National Cancer Institute-designated comprehensive cancer center in the Pacific Northwest. Through our Tumor Research Initiative, we are finding new ways to detect tumors at an early stage.  We are presently seeking an experienced IT Business Analyst to assess technology needs for the Tumor Research Initiative, and to identify and design improvements to computer based systems.  For more information please visit www.fhcrc.org and search for Job# AD-21465

For reprints and/or copyright permission, please contact RMS, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext 100 or via email to bio-itworld@theygsgroup.com.