Carole Goble: Democratizing Informatics for the ‘Long Tail’ Scientist

March 31, 2011 | In 1979, Carole Goble went to the University of Manchester to study computer science, and never looked back. As professor of computer science, Goble is a popular and prodigious exponent of open source software, from knowledge information systems and databases to Semantic Web and data integration. She has helped create and/or develop a host of life science tools including Taverna, myExperiment, and BioCatalogue, all in the name of democratizing informatics. Goble recently spoke to Bio•IT World editor-in-chief Kevin Davies about her mission to democratize informatics for life sciences.

BIO•IT WORLD: Carole, how did you first get into computer science?

CAROLE GOBLE: I began programming in high school in 1977 at age 16. I went to an all girls’ grammar school, so there was no thought that this was a gendered topic—and it wasn’t. Nobody knew what the hell it was, really. We had a rackety terminal behind the corner of the library, and I used to program on that! It only became something gendered that boys did in the 1980s, when computers went into the school. So I escaped all that. I only discovered it was a thing that boys were supposed to do when I went to my first class at university, and said, “Where are all the women?!” That was a shock.

I programmed using an acoustic coupler, using the old-fashioned phone headset and coding sheets. At university, where we had mainframes, you’d write the program, hand them in to a team of women who typed up the program. You had a traffic light system, a queue, and you had 20 minutes to fix the bugs they’d introduced. It’s amazing we wrote anything! And you tell the young people that today...

And when did you get interested in the life sciences?

That came in the mid ‘90s. I joined the faculty to teach, and was mentored by Alan Rector (a medical informatician). He was extraordinary, a visionary, doing ontologies, controlled vocabularies, visually driven interfaces for clinical data collection, completely new models for how doctors should interact with patient data, and electronic records in the 1980s, when people laughed at him. Of course, he was proven right.

In the mid ’90s, I met a bioinformatician named Andy Brass. I started a project with him on data integration in life science data. He had PhD students busy stitching together data from various datasets, and he wanted a data integration system to help with that. So we started a whole program of knowledge-driven integration systems, ways to describe and mediate different datasets in order to get the answers. That was a system called TAMBIS, which turned out to be very influential, one of the first (and very few) ontology-drive mediation systems for life sciences data. It never went into commercial production, because it’s quite labor intensive and really expensive. This predates the Gene Ontology, etc. That’s why we were really ready for that kind of activity.

The next time I had a crack at this was in the UK’s e-Science program. I thought, OK we know about this Semantic Web stuff, I want to do workflows not databases, to help people deal with the heterogeneity and the mess. We’d describe the mess, and give people tools to stitch together datasets that managed the mess. We’d do much more mixing in the social interaction piece—we wouldn’t solve the problem just through technology. We’d solve data integration, method sharing, reproducibility in science, new ways of publishing results that put a primacy on methodology, automated methods for analytical pipelines, and incorporate the people aspect. We wanted to build ways of sharing know-how so you wouldn’t constantly reinvent the wheel. Bioinformatics suffers from a lot of reinvention.

That led to the myGrid project, which is now this big international activity, and it begat Taverna, a scientific workflow workbench and platform; and it begat a whole slew of e-laboratories, of which BioCatalogue and myExperiment are examples. We also have Methodbox for sharing datasets in social sciences, especially obesity, and we’ve built a systems biology sharing platform (SEEK) for microorganisms across Europe. That’s all built on the notion of automated pipelines and collaboration across scientists to share their assets and methods, and to crowd curate and collaborate—not necessarily synchronously, but asynchronously, by pooling know-how. This is all driven around my research in social computing, workflow technologies and Semantic Web technologies. Because basically, the way you handle this is through metadata, from scruffy tagging to rich ontologies.

What I particularly love doing is taking research ideas from computing and applying them. I find it difficult to do things in the theoretical sphere. I’m much more interested in how to put things into practice.

How big is your research group in Manchester?

The Information Management group, our mother group, is about 60-70 researchers. The myGrid project has about 25-30 people at any one time and team members in other universities too, such as Oxford and Newcastle in the UK and Leiden in the Netherlands. At Manchester, I have 17 postdocs. It’s very postdoc-heavy, which is possibly unusual, certainly in computer science, because two thirds of them are software engineers, not computer science researchers, using very agile methods to build the software. Then I have computer science researchers and four bioinformaticians. I also have astronomers and social scientists to help us build the social collaboration environment and our pipelining tools and digital libraries, which are using our tools for understanding how to preserve experimental method. How do you preserve workflows, tools, and datasets? I have a project partnered with SAGE Bionetworks: what does it mean to cite a network as you’re building a model using a computational workflow? How do you do attribution and credit management? We are addressing new models of scholarship.

We’re just starting a project with the National Libraries in Europe, including the British Library, to use our tools for preservation pipelines for documents—basically image processing. I don’t really care what it is, it’s just data!

Are you a real open-source evangelist?

The reason we went open source is, I was of opinion that we were funded quite well in the e-Science program and I thought it important that the results of public investment be made public. Open source doesn’t mean you can’t make money out of it. It’s still exploitable.

Secondly, I was really driven to this because I saw my bioinformatics/life science colleagues basically cutting and pasting from web sites and acting like machines to do these pipelines manually, because they had no other way of doing it. Any tools available to them were out of their price bracket. I wanted to help not so much ‘Big Science’ but the small laboratory, the ‘Long Tail’ scientist. If you think about it, they’re all to do with pooling the long tail for the benefit of the greater good, trying to get people who don’t know one another and don’t have amazing resources available to them, but have good research ideas. So the software had to be open source.

Then there’s the sustainability route: with open source, you can begin to create a pool of activity internationally and make it available for universities to incorporate into their training programs. So you have an education piece as well and a potential community to help you improve and sustain it. One of the things we’ve been doing with Taverna is trying to integrate third-party applications into the software. We’re establishing the Taverna Foundation that will manage the open development of the software.

How does Taverna relate to the commercial workflow tools from companies such as Accelrys and InforSense?

[InforSense founder] Yike Guo is a friend of mine—in fact, Patrick Wendel, one of the core designers of InforSense’s platforms—I was his PhD external examiner! It’s similar to Pipeline Pilot, taking services, components, and stitching them together into pipelines. The difference between Taverna and the other two is that we took a view that we’d take on any kind of Web service natively, without a priori forcing that service to go through a wrapping process in order to be incorporated into the workflow... The activation energy to incorporate your favorite tool into Taverna was very low.

The cost of that is heterogeneity between the different services. If the services are defined as a family, it all works beautifully; if not, then you’ve got to do some massaging. That’s built into the components (or nodes), but it’s more exposed in Taverna. We’re making those massaging elements (or shims) to hide or wrap them into the components...

With Galaxy, for example, you have to wait for somebody to incorporate it into the framework to use it. But with Taverna, you don’t have to do that. We also have very open platform for incorporating activities (the steps in the workflow). There’s a plug-in framework so different families of different kinds of services—the BioMart or BioMoby or Grid Service—they all have a certain kind of conversation style with their services. That all gets hidden so you don’t have to worry about that.

How widely used is Taverna?

It’s been available in downloadable form since 2005, but seriously taken up since 2006, when we became part of the Open Middleware Infrastructure Institute and replaced the original research team with a software engineering team. We now have two versions of Taverna: the earlier series 1 that is primarily desktop and a completely re-engineered Taverna 2 series that runs on desktop and servers, handles large-scale data and workflows and is an extensible platform, based on OGSi, for applications to integrate with more easily. Right now we have around 400 research organizations around the world using Taverna, including the caBIG cancer project in the US. That’s about 1,000 scientists using the tool at any one time.

My personal favorite application is a study of the sleeping sickness trypanosomiasis parasite in cattle. I like this project because it’s based in Nairobi, Liverpool and Manchester, a nice mix of biologists and informaticians, investigating why certain cattle are resistant to this blood parasite. It’s a mixture of ’omics—QTL data, microarray data, mining the literature, metabolic pathways, a whole series of workflows and pipelines that process and annotate every new gene linked to this disease.

The first set of workflows revealed a gene related to cell death using these pipelines, built by Paul Fisher (a PhD student at the time) using public datasets and locally collected data, found unexpected relationships between cholesterol synthesis and disease resistance... What’s really interesting is that we’ve since deployed workflows on the Amazon Cloud that are able to integrate with those same workflows. Our partners in Liverpool are taking delivery of whole genomes. By using this Cloud workflow, we’ve put next-gen sequencing analysis and post-assembly annotation pipelines into the Cloud. Now we can do complete annotation of an entire genome in a few hours. What’s really cool is, it’s within the financial reach of researchers—a few dollars [for Amazon, the software is free]. We’re sending a group to train our colleagues in South Africa who can’t afford Pipeline Pilot.

Is Taverna used in industry or biotech at all?

That’s interesting—some commercial companies do use it, particularly in small biotechs. We know some larger drug companies have used it anecdotally, for instance building prototype pipelines in the early part of pharma discovery... We have a partnership with Eagle Genomics for Taverna’s commercial support. We’re putting together next-gen sequencing work we’ve been piloting with our UK and African colleagues to take that as a commercial service, as a workflow-based application in the Cloud.

We’ve been talking about the Semantic Web in life sciences for a good five years. Where are we with that?

I was in Semantic Web from the beginning, before it was called Semantic Web! I was the founding editor-in-chief of the Journal of Web Semantics. Semantic Web is the elephant to a blind man—everything to everybody.

One of the prime pieces of the Semantic Web is the notion you could publish into a common model, facts and information available only as a person-readable means. You’d be able to crosslink and query across these different information resources so machines could manipulate and find information and connect people in new ways. That’s what linked open data is: the original idea was, here’s a Web page, I’ll annotate it, and those facts will be pooled into this knowledge web or information space, which will be used as a platform for doing all sorts of applications. What was underplayed too much in the emerging Semantic Web was that this you could this with datasets—forget web pages, just push datasets out in a common data model, let’s call that ‘RDF’ [the Resource Description Framework]. Then if we had some common identifiers and common vocabularies, we could begin to build bridges between datasets. Imagine the London Underground map—that was the vision, right? But somewhere along the line it got migrated into automated reasoning and AI and inference and rich ontology models, which are great, but miss the point of basically indexing and linking.

Linked open data is a return to publishing, indexing, and linking. This is largely what you want to do—I want to find a connection between two datasets, I want to do some aggregation for multiple datasets around a particular protein or assay or scientist. For this, you have to focus on identity and adoption of common vocabularies. The adoption is more important than the complexity of the vocabularies, because you’re relying on the ubiquity of the terms. It’s just enough, just in time, not just in case.

When we return to that, you see that makes some sense, a return to ideas of the 1990’s: how do we do data integration using some descriptions? So now we’re on track, but there are a few fundamental issues we have to sort out:

1) We have to sort out the infrastructure. We need scalable tools to handle large amounts of data. We’ve built trillions of facts that are readily integratable. I don’t ever want to see an RDF triple—I want to see a genome browser that happens to be powered by RDF triples.

2) We need ubiquity in publishing using some identifiers and some common concepts so we can do some linking. The emphasis is on adoption.

3) We need a primary mechanism of dealing with provenance—where did this thing come from? Amazingly, that wasn’t considered a prime piece of the Semantic Web originally. There is no infrastructure for provenance and versioning. That’s pretty damn important. We absolutely have to sort that out.

The Concept Web Alliance, led by Barend Mons at Leiden, is a group of biologists who say, instead of just publishing papers, we build a Semantic Web of facts and information that we all know, buried in the literature, into a knowledgebase. But it has to have provenance, versioning, and it has to have the notion of mapping different identity worlds. We’re currently negotiating a European Innovative Medicines Initiative Project [AstraZeneca’s Ian Dix discussed that at Bio-IT World Europe in 2010] to tackle these issues.

Is the Semantic Web being utilized as much as it could in life science organizations?

No, but I think of Semantic Web as infrastructure. I don’t know how the power works to power my laptop or how my mobile phone reaches Shanghai, but I know how to use the phone. I don’t know how some databases work when I’m browsing the Web, but someone has built an application so I can use it. If we reveal the machinery of the Semantic Web, we’re missing the point. Bioinformaticians may or may not choose to understand it, some may choose not to. But a biologist? No.

Here’s an example: Our systems biology e-Laboratory—what do they use for data entry? Spreadsheets. We built a tool called RightField, to browse the biology ontologies, and to wire in the ontological terms into the cells of the spreadsheet, so when we ship to an experimentalist, they only see a spreadsheet with dropdown boxes. What they’re actually doing is annotating their data with ontologies and identifiers that can be put into a linked data pool. But they don’t know that! All they get is a spreadsheet with drop-down boxes. That is what you need.

Tell us a bit about your other open-source tools, myExperiment and BioCatalogue.

myExperiment is a place to find and share scientific workflows. BioCatalogue is a curated catalogue of public Web services for the life sciences. myExperiment has nearly 4,500 members who have deposited over 1,600 workflows from 20 different systems. BioCatalogue has over 1,700 services from 142 service providers.

They’re both working in the world of social computing, sharing for reuse and crowd sourcing, trying to leverage the Web 2.0/Facebook generation in order to create social networks for science. An important issue is understanding that sharing scientific assets and collaborating is not the same as sharing pictures or your favorite football team. There’s credit and attribution. There’s ‘sharing creep,’ I can share with you and him but not my competitors, and my colleagues when I’m ready. Understanding what it means to share and open up valuable assets before and after they’re published. It has to fit in understanding to collaborate in a dog-eat-dog scientific world. BioCatalogue is about what it means to curate? What are the incentives to contribute to and co-curate these systems, particularly to curate someone else’s stuff?

It’s social computing and semantic computing, building links between people, and grouping things together into ‘research objects’. Research objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. I’m no longer thinking about publishing a workflow but publishing a research object—the workflow with its input data and its history, linked to a PDF of the paper, linked to a video, etc. These are steps towards scientific reproducibility, and the reuse of methods and results. We have another European Union funded project, Wf4Ever, that is examining these issues.

What are some of your future goals?

My goal is to bring all these things together, my original vision of an end-to-end platform, the tools people use to do their science right the way through to a publishing mechanism. To finish what I started, really! To get adoption and make this all sustainable, so you can rely on it. I spend a lot of time getting funding, not research funding but to shore up the main platform(s). On the research side, it’s building the Semantic Web of biology. That’s my new project, starting in March. Let’s build a Semantic Web to support drug discovery. It’s an EU IMI project called OpenPHACTS, working with eight pharmaceutical companies and the Royal Society of Chemistry as well as a host of European research centers.

What about the challenge of medical informatics, interpreting the deluge of next-gen sequencing data?

Oh yes. That’s what I’m partnering with Eagle Genomics in. If you can build these post-assembly annotation pipelines, and then run them to get a whole genome—we’ve been working with cows, not people, but you can do an entire genome annotated with SNPs, biomarkers in six hours. The time taken for the annotation really depends on how much resource you throw at it of course. The runs we’ve done so far using the Amazon Cloud have come in under two hours for 12 million SNP.

So you can sequence, assemble and annotate a genome in a day, but how you interpret it is another matter! That makes a big difference. You’re no longer guessing. This becomes industrial scale. At Bio-IT World Europe, Marjolein Kriek (University of Leiden) talked about the fact that we’re just on the edge of being able to do the data processing. But interpretation is another thing altogether.

So I’m putting Taverna into the Cloud, and you can run these workflows in the Cloud and wire them into your application. If you have a particular kind of tool or algorithm, the heavy lifting of genome annotation, you send it to a service just like SalesForce or Google Docs. You don’t even do the infrastructure or have a bioinformatics group—there’s a global bioinformatics group. We’re harnessing the long tail of bioinformatics expertise, putting it in the Cloud, so anyone can try out all sorts of experiments. We now have the infrastructure to democratize the informatics. •