Workflow Environments Guide


By Chris Dwan

Aug 15, 2005 | During our recent work with the Web services interface to iNquiry, BioTeam has gained familiarity with several graphical workflow packages for scientific computing. These tools have been gathering acceptance in bioinformatics, genomics, and general scientific computing groups from large pharmaceutical companies to single investigators.

I’ve compiled a short list of the features that I use to differentiate these offerings when selecting the one that is most appropriate for a particular user. As with many technology decisions, the choice of a workflow environment is seldom clear. Many factors must be weighed in the context of user requirements, local expertise, and required features.

The packages I’ve worked with are Taverna, a free, open-source workflow environment produced as part of the MyGrid project; InforSense; and Scitegic’s Pipeline Pilot, commercial products with robust features and enterprise-level support; and Apple’s Automator. Apple has built Web services capabilities into their Tiger operating system, and Automator is a way to access these services. Packages I simply have not yet had the time to try out are TurboWorx, the Broad Institute’s GenePattern,  and VIBE from Incogen.

Features I use to differentiate between offerings are:

Support for basic programmatic constructs. While graphical environments will never replace traditional interpreted or compiled programs, they should still support the full range of language constructs required to implement arbitrary algorithms. This includes conditional execution (if/else), loops (do/while), and rudimentary variables. These features are absolutely essential to developing large, complex protocols.

Multiple inputs/outputs for modules. Useful modules produce multiple input and output streams.

Failure handling. Developing workflows for a complex, heterogeneous, highly connected infrastructure requires what might be called defensive programming. Errors will inevitably occur outside the purview of the developer. Workflow environments need to provide easy access to underlying error codes and messages, as well as clear notification as to which steps in a process failed and need to be recomputed. A clean way to differentiate between transient and permanent errors would be a huge plus.

Cached results/partial reexecution. For me, at least, debugging requires running a process over and over again, working out the errors from beginning to end. The ability to selectively reexecute those portions of a workflow that have changed or depend on those changed modules helps accelerate this process.

User interaction/steering. Some processes (particularly those relevant to a bench scientist) require interaction and decision making in the middle. While it is simple enough to create N+1 workflows for a process with N user interactions, it is better to explicitly support user choice, input, and notification without stopping and restarting the entire pipeline. A very-high-level version of this would involve publishing process status notification via an RSS feed or similar technology. Of course, this would only encourage the Blackberry crowd to check their processes more frequently than they already do.

Ease of relocation. Perhaps the best part of Web services technology is the fact that services are explicitly virtualized. In theory, this means that workflows should be entirely portable. Workflow environments should make it simple to point a particular action at a different service provider. If I publish a workflow that points at a set of services on my cluster/database/grid then remote users should be able to redirect each call to their local resource with minimal effort.

Revision history. As workflows become part of the enterprise environment, they will need the same sort of revision control as any other document. For workflows saved as XML files, this can be simply implemented with a revision control system such as RCS, CVS, or SVN. Robust integration with the workflow environment itself is a big plus.

Command line execution. The emerging-use model for workflows appears to be that expert developers will create protocols for use by others. This means that in many cases, the workflows themselves will be pieces in other automated systems. Therefore, they must support execution from the command line and thus automated or remote invocation.

Encapsulated scripting. No environment will ever provide every possible module. One of the most powerful features I’ve seen in any of these tools is the ability to very simply define a “script wrapper” action. Of course, this could lead to abuses of the environment such as wrapping an existing monolithic PERL script in a single action and declaring it a workflow.

Disconnect/reconnect. Production workflows must support long running processes. In the extreme case, some pipelines will run perpetually, receiving new data from automated instruments. I simply cannot endorse any product that requires me to leave my laptop connected to the Internet for my jobs to run.

Process encapsulation. Both of the commercial offerings allow me to wrap up a set of calls into the equivalent of a subroutine and then to republish that subroutine as a Web service using WSDL and SOAP. This is absolutely imperative for many reasons, not least of which is the fact that the whole point of a graphical workflow system is to mitigate complexity and provide a clear and simple view of the process being implemented. When workflows require wall-sized posters to display, they no longer serve that purpose.

Parse WSDL; speak SOAP. This seems self-evident to me: Any new programmatic technology should make use of Web services and discoverable resources.

I’m certain that this is not an exhaustive list. These are just a few points that I’ve seen in a couple of months of working with the technology.

The compelling differentiator for me comes down to user expectations and needs. An academic lab with limited financial resources will find the free and open-source tools appealing. Corporations with enterprise-level computing needs tend to be willing to pay a premium for tools with support teams to back them up. The technology is still young and malleable enough that both groups will find plenty of opportunity to do great and interesting things, and these graphical environments provide a valuable addition to the scientific computing toolbox.

 

Chris Dwan is a senior consultant with The BioTeam. E-mail: cdwan@bioteam.net.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

thomson reuters image
Biomarkers: An Indispensible Addition to the Drug Development Toolkit
Examining the Potential of Biomarkers
Sponsored by Thomson Reuters

Biomarkers are becoming an essential part of clinical development. In this white paper, Thomson Reuters provides insight from experts in industry and academia, and explores the role of biomarkers as evaluative tools in improving clinical research and the challenges this presents.

Discover the potential of biomarkers to:

  • Improve decision making
  • Accelerate drug development
  • Reduce development costs


BlueArc_Scientific Data
Scientific Data Lifecycle Management: Preparing for Storage in an Uncertain Future
Sponsored by BlueArc

Managing vast and overwhelming streams of gene sequencing data today requires ultra-high performance systems and processes. With continued rapid advancement and improvements in gene sequencing, expect tomorrow’s instruments to output quantities of genomic information that will dwarf current levels. Help your organization maintain data control and prepare for the future of sequencing through this informative paper that discusses:

  • The information technology challenges of gene sequencing
  • “Intelligent” methods for data management and customization
  • System survival tips... Deciding what data to keep or delete
  • New tools to keep scientists ahead of impending data torrents


SAS Managed image
Managed Innovation, Assured Compliance
Developing, executing and managing the transformation, analysis and submission of clinical research data with SAS® Drug Development
Sponsored by SAS
Get better products to market faster. Download this white paper to discover the top ten challenges facing life science executives and how to overcome them. See how SAS Drug Development transforms clinical data into true innovation.


Life Science Webcasts & Podcasts

Presented by Trade Commission of Spain

Spain Biotech: An Engine for Economic Change 

TCS podcastDiscover how Spain is focusing on biotechnology to be an engine for economic change through gradual internationalization, development and technology transfer.

Regional governments are actively investing in public and private biology research and promoting the creation of knowledge-based companies. Spain’s human capital combined with aggressive investment in biotech research and infrastructure has led to the creation of bio-clusters.

Today, there are nearly 700 Spanish companies engaged in biotechnology, with almost 50 percent growth in funding devoted to research. In fact, spending on internal R & D in biotechnology has grown 46 percent and is close to 300 million Euros.

Access the podcast 

 



More Podcasts

Job Openings

saic_logo

MANAGER, SCIENTIFIC COMPUTING & PROGRAMMING
(Bioinformatics Manager)
SAIC-Frederick, Inc has an exciting opportunity for a Manager, Scientific Computing & Programming - Core Genoytyping Facility in Gaithersburg, Maryland.  In this role, you will lead the Bioinformatics & Analysis Group.
Master’s or equivalent required.  PhD preferred. Six years experience in development of scientific programs in high-performance computing environment including five years supporting scientific research in computational chemistry, biology, or genetics, & two years supervisory experience.  View complete job posting & apply: www.saic-frederick.com. Position #146945.

For reprints and/or copyright permission, please contact The YGS Group, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.