YouTube Facebook LinkedIn Google+ Twitter Xingrss  

The State of Bioinformatics Kits

Brian Osborne

March 24, 2009 | Inside the Box | A toolkit is what to us, the users? It’s that collection of software shortcuts that runs a biological idea through the processor. Familiar bioinformatics toolkits include BioJava, BioPerl, BioPython, and BioRuby. Their corresponding languages—Java, Perl, Python, and Ruby—know no science. But a kit knows chem- and bio- or any informatic that presents its face. These languages are not language-like. They are stiff—they reveal an odd love of a standard and change, if at all, with lurching and heavy labor. The kit is what the language is not—a route to an answer through code that reduces and conceals the language. The kit is clay to the language’s concrete.

There is some unknown relating the language and its kit. Emit the name of the language “PHP” close to the screen of one of your more inflammable friends and watch in delight as a derision reaction ignites in fumes and sparks in his brain. But that thing creates the MediaWiki kit which begat Wikipedia, the Wiki that begat a thousand Wikis. Our most correct language, Java, composes BioJava, but it is barbarous Perl that hacks out BioPerl, the kit that draws the most biologists to our fruitful addiction. Is that because PHP and Perl are, in their hearts, kits?

Claim to Fame

People working in and around open source notice that progress is frequently accomplished through heroism, preternatural efforts by the few on behalf of the many (or sometimes by and for the few). The kit is their claim to fame, their playground and their gallery, their blue sky. They weave with words, pulling the bit-net tighter until it’s all under their touch. So for that we thank the near-anonymous in this simple way: Sendu Bala and Chris Fields, of BioPerl, and Yaron Koren, from the Semantic Wiki world.

We are nothing if not disproportionate. We place each leaf in an ontology with care, but then these trees will dot the datascape, far from the massive data mounds. We decipher and notate the genomes but each in our own way, building detailed memorials to the DNA, with no roads between. Connecting all dots is not what we always do. We have created machines that write billions of nucleotides in a day, but will we leave meaning buried beneath? A million scintillants on their dark grids, times the number of phosphorylated flows. That times the number of precious samples and that times the highway, time. We’re now like the astronomers with their fourth dimension, measuring genomes in evolution, meta-genomes over seasons, genomes over lifespans. We’ve hit supernumerary.

Open Source & Ownership

But when you do dig deep in sequence you may find a treasure. A highlight at the recent Advances in Genome Biology and Technology conference (see p. 8) was the talk by a studious Canadian, Marco Marra. Neuroblastoma is a rare cancer, but we all know it because of its frequency in children, the fatalities, and the severity of treatment. Marra and his group wanted the full mutational details of its transcriptome, from cells highly enriched for the tumor initiators, as close to the cold events as you can get. But the expectation and the result were a profound mismatch. What they read was a list of changes in sequences with gene names that are only linked to B cell development, not neurons. The story of this disease is completely transformed by this surprise, the medicine will refocus on new therapies, and hopes of different outcomes.

Somehow if you can sift your haystack your way, you don’t worry about the needles. It is not just craft but the feel of collective ownership that pushes open source, a deep wish to create extraordinary function that all will use. One current thought is that we may want to wrap a new present, the “Next Generation” toolkit.

The Bio* toolkits were born when “single” was the norm: gene, interaction, protein, message, CPU. We will now create the 21st century kit, which will gyre vast hashes about their axes, all ids, terms, and data, cutting, intersecting and jetting off to methods to precisely annotate and detect. Or will we? This next one looks less like a standalone codebase than a knitting together of R and BioConductor and Perl perhaps, or the BioLib project’s libraries, a kit of kits.

Can we count you in? There is an ebb and a flow in open source, one worker rests, then you want another to feel the brilliance. The apps are there, the languages are there, we wait to wrap our next gift.

Brian Osborne is a Principle Investigator at The BioTeam. He can be reached at

 This article also appeared in the March-April 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.

Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact  Jay Mulhern, (781) 972-1359,