Aug 15, 2005 | During our recent work with the Web services interface to iNquiry, BioTeam has gained familiarity with several graphical workflow packages for scientific computing. These tools have been gathering acceptance in bioinformatics, genomics, and general scientific computing groups from large pharmaceutical companies to single investigators.
I’ve compiled a short list of the features that I use to differentiate these offerings when selecting the one that is most appropriate for a particular user. As with many technology decisions, the choice of a workflow environment is seldom clear. Many factors must be weighed in the context of user requirements, local expertise, and required features.
The packages I’ve worked with are Taverna, a free, open-source workflow environment produced as part of the MyGrid project; InforSense; and Scitegic’s Pipeline Pilot, commercial products with robust features and enterprise-level support; and Apple’s Automator. Apple has built Web services capabilities into their Tiger operating system, and Automator is a way to access these services. Packages I simply have not yet had the time to try out are TurboWorx, the Broad Institute’s GenePattern, and VIBE from Incogen.
Features I use to differentiate between offerings are:
Support for basic programmatic constructs. While graphical environments will never replace traditional interpreted or compiled programs, they should still support the full range of language constructs required to implement arbitrary algorithms. This includes conditional execution (if/else), loops (do/while), and rudimentary variables. These features are absolutely essential to developing large, complex protocols.
Multiple inputs/outputs for modules. Useful modules produce multiple input and output streams.
Failure handling. Developing workflows for a complex, heterogeneous, highly connected infrastructure requires what might be called defensive programming. Errors will inevitably occur outside the purview of the developer. Workflow environments need to provide easy access to underlying error codes and messages, as well as clear notification as to which steps in a process failed and need to be recomputed. A clean way to differentiate between transient and permanent errors would be a huge plus.
Cached results/partial reexecution. For me, at least, debugging requires running a process over and over again, working out the errors from beginning to end. The ability to selectively reexecute those portions of a workflow that have changed or depend on those changed modules helps accelerate this process.
User interaction/steering. Some processes (particularly those relevant to a bench scientist) require interaction and decision making in the middle. While it is simple enough to create N+1 workflows for a process with N user interactions, it is better to explicitly support user choice, input, and notification without stopping and restarting the entire pipeline. A very-high-level version of this would involve publishing process status notification via an RSS feed or similar technology. Of course, this would only encourage the Blackberry crowd to check their processes more frequently than they already do.
Ease of relocation. Perhaps the best part of Web services technology is the fact that services are explicitly virtualized. In theory, this means that workflows should be entirely portable. Workflow environments should make it simple to point a particular action at a different service provider. If I publish a workflow that points at a set of services on my cluster/database/grid then remote users should be able to redirect each call to their local resource with minimal effort.
Revision history. As workflows become part of the enterprise environment, they will need the same sort of revision control as any other document. For workflows saved as XML files, this can be simply implemented with a revision control system such as RCS, CVS, or SVN. Robust integration with the workflow environment itself is a big plus.
Command line execution. The emerging-use model for workflows appears to be that expert developers will create protocols for use by others. This means that in many cases, the workflows themselves will be pieces in other automated systems. Therefore, they must support execution from the command line and thus automated or remote invocation.
Encapsulated scripting. No environment will ever provide every possible module. One of the most powerful features I’ve seen in any of these tools is the ability to very simply define a “script wrapper” action. Of course, this could lead to abuses of the environment such as wrapping an existing monolithic PERL script in a single action and declaring it a workflow.
Disconnect/reconnect. Production workflows must support long running processes. In the extreme case, some pipelines will run perpetually, receiving new data from automated instruments. I simply cannot endorse any product that requires me to leave my laptop connected to the Internet for my jobs to run.
Process encapsulation. Both of the commercial offerings allow me to wrap up a set of calls into the equivalent of a subroutine and then to republish that subroutine as a Web service using WSDL and SOAP. This is absolutely imperative for many reasons, not least of which is the fact that the whole point of a graphical workflow system is to mitigate complexity and provide a clear and simple view of the process being implemented. When workflows require wall-sized posters to display, they no longer serve that purpose.
Parse WSDL; speak SOAP. This seems self-evident to me: Any new programmatic technology should make use of Web services and discoverable resources.
I’m certain that this is not an exhaustive list. These are just a few points that I’ve seen in a couple of months of working with the technology.
The compelling differentiator for me comes down to user expectations and needs. An academic lab with limited financial resources will find the free and open-source tools appealing. Corporations with enterprise-level computing needs tend to be willing to pay a premium for tools with support teams to back them up. The technology is still young and malleable enough that both groups will find plenty of opportunity to do great and interesting things, and these graphical environments provide a valuable addition to the scientific computing toolbox.
Chris Dwan is a senior consultant with The BioTeam. E-mail: firstname.lastname@example.org.