Workflow Environments Guide



Aug 15, 2005 | During our recent work with the Web services interface to iNquiry, BioTeam has gained familiarity with several graphical workflow packages for scientific computing. These tools have been gathering acceptance in bioinformatics, genomics, and general scientific computing groups from large pharmaceutical companies to single investigators.

I’ve compiled a short list of the features that I use to differentiate these offerings when selecting the one that is most appropriate for a particular user. As with many technology decisions, the choice of a workflow environment is seldom clear. Many factors must be weighed in the context of user requirements, local expertise, and required features.

The packages I’ve worked with are Taverna, a free, open-source workflow environment produced as part of the MyGrid project; InforSense; and Scitegic’s Pipeline Pilot, commercial products with robust features and enterprise-level support; and Apple’s Automator. Apple has built Web services capabilities into their Tiger operating system, and Automator is a way to access these services. Packages I simply have not yet had the time to try out are TurboWorx, the Broad Institute’s GenePattern,  and VIBE from Incogen.

Features I use to differentiate between offerings are:

Support for basic programmatic constructs. While graphical environments will never replace traditional interpreted or compiled programs, they should still support the full range of language constructs required to implement arbitrary algorithms. This includes conditional execution (if/else), loops (do/while), and rudimentary variables. These features are absolutely essential to developing large, complex protocols.

Multiple inputs/outputs for modules. Useful modules produce multiple input and output streams.

Failure handling. Developing workflows for a complex, heterogeneous, highly connected infrastructure requires what might be called defensive programming. Errors will inevitably occur outside the purview of the developer. Workflow environments need to provide easy access to underlying error codes and messages, as well as clear notification as to which steps in a process failed and need to be recomputed. A clean way to differentiate between transient and permanent errors would be a huge plus.

Cached results/partial reexecution. For me, at least, debugging requires running a process over and over again, working out the errors from beginning to end. The ability to selectively reexecute those portions of a workflow that have changed or depend on those changed modules helps accelerate this process.

User interaction/steering. Some processes (particularly those relevant to a bench scientist) require interaction and decision making in the middle. While it is simple enough to create N+1 workflows for a process with N user interactions, it is better to explicitly support user choice, input, and notification without stopping and restarting the entire pipeline. A very-high-level version of this would involve publishing process status notification via an RSS feed or similar technology. Of course, this would only encourage the Blackberry crowd to check their processes more frequently than they already do.

Ease of relocation. Perhaps the best part of Web services technology is the fact that services are explicitly virtualized. In theory, this means that workflows should be entirely portable. Workflow environments should make it simple to point a particular action at a different service provider. If I publish a workflow that points at a set of services on my cluster/database/grid then remote users should be able to redirect each call to their local resource with minimal effort.

Revision history. As workflows become part of the enterprise environment, they will need the same sort of revision control as any other document. For workflows saved as XML files, this can be simply implemented with a revision control system such as RCS, CVS, or SVN. Robust integration with the workflow environment itself is a big plus.

Command line execution. The emerging-use model for workflows appears to be that expert developers will create protocols for use by others. This means that in many cases, the workflows themselves will be pieces in other automated systems. Therefore, they must support execution from the command line and thus automated or remote invocation.

Encapsulated scripting. No environment will ever provide every possible module. One of the most powerful features I’ve seen in any of these tools is the ability to very simply define a “script wrapper” action. Of course, this could lead to abuses of the environment such as wrapping an existing monolithic PERL script in a single action and declaring it a workflow.

Disconnect/reconnect. Production workflows must support long running processes. In the extreme case, some pipelines will run perpetually, receiving new data from automated instruments. I simply cannot endorse any product that requires me to leave my laptop connected to the Internet for my jobs to run.

Process encapsulation. Both of the commercial offerings allow me to wrap up a set of calls into the equivalent of a subroutine and then to republish that subroutine as a Web service using WSDL and SOAP. This is absolutely imperative for many reasons, not least of which is the fact that the whole point of a graphical workflow system is to mitigate complexity and provide a clear and simple view of the process being implemented. When workflows require wall-sized posters to display, they no longer serve that purpose.

Parse WSDL; speak SOAP. This seems self-evident to me: Any new programmatic technology should make use of Web services and discoverable resources.

I’m certain that this is not an exhaustive list. These are just a few points that I’ve seen in a couple of months of working with the technology.

The compelling differentiator for me comes down to user expectations and needs. An academic lab with limited financial resources will find the free and open-source tools appealing. Corporations with enterprise-level computing needs tend to be willing to pay a premium for tools with support teams to back them up. The technology is still young and malleable enough that both groups will find plenty of opportunity to do great and interesting things, and these graphical environments provide a valuable addition to the scientific computing toolbox.

 

Chris Dwan is a senior consultant with The BioTeam. E-mail: cdwan@bioteam.net.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 

Tessella delivers software engineering and consulting services to leading pharmaceutical and biotech companies. We are recruiting Software Engineersto work with skilled bioinformaticians and scientists to identify business needs and recommend and develop technical solutions. Applicants require BS, MS or PhD in bioinformatics, biology or chemistry and 2+ years of software development in either: Java, C#, C++, C or VB.NET. 

Apply at http://jobs.tessella.com   

 

oxford nanopore logo 


 Early Access Collaborations Managers
Oxford Nanopore Technologies is developing a novel technology, GridIONTM for the direct, electronic analysis of DNA/RNA and other analytes.  As the system approaches the market, we are building a team of technically knowledgeable, highly motivated candidates with excellent customer service and facilitation skills to join our company as Collaboration Managers.  This is a unique opportunity to work with world-leading genomics customers throughout the early adoption phase of a new generation of DNA sequencing technology.. This is a facilitative, enabling role with responsibility for managing technology development collaborations with key customers at leading genomics institutions.  It will include long term management of the collaboration plan and milestones and associated meetings and documentation. Click here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Oxford Nanopore Technologies is a fast-moving technology company that is developing a novel electronic molecular analysis technology. The technology is adaptable for the analysis of DNA/RNA, proteins, chemicals and other molecules.  It is therefore suitable for use in a variety of markets including scientific research and clinical applications.  As the technology approaches the market, Oxford Nanopore is seeking a visionary VP of sales and marketing to join the senior team.  The candidate will embrace the opportunities afforded by entering the market with a truly disruptive technology that has the potential to expand the number of users and the variety of applications in each target market.  This is a rare opportunity to influence the commercial strategy at an early phase of its commercial lifetime, in a well funded company.  Oxford Nanopore welcomes applications from candidates with a track record of high-level strategic commercial  leadership, who wish to apply a fresh approach to existing markets.  Experience in Life Sciences/DNA sequencing is central to this role, however we will consider your application if you have experience of disruptive technologies in other related industries.  We are particularly interested in candidates with strong expertise in the use of digital technologies for sales and marketing of scientific/technical products.  Click to  Apply  


 

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .