IMAGING INFORMATICS
By John Russell
Oct. 23, 2008 | OK, the gratuitous use of BLAST here will probably make Eugene (Gene) Myers cringe – not that the Howard Hughes Medical Institute investigator doesn’t want to repeat his past success, this time in imaging informatics. Myers and distinguished colleagues at the NIH invented BLAST to cope with the growing flood of sequence data that was then befuddling researchers. He later served as VP Informatics Research at Celera and spearheaded development of its whole-genome shotgun sequencing protocol and algorithms.
Tough acts to follow, for sure. Today, Myers is again wading into a data flood, but instead of dealing with the puny 3.1 billion base pairs of human DNA, he is tackling the 4.2 trillion voxels that imaging a mouse’s brain will produce – in just one week of the project. Affordable and improved imaging technology is promising to light up the molecular landscape of living systems with far-reaching impact on basic research, drug development, and in the clinic. Myers spoke with Predictive Biomedicine editor John Russell about some of the opportunities and challenges imaging and imaging informatics present.
JR: Maybe you could start with why imaging informatics is becoming more important.
Myers: The imaging stuff is really just that. It’s about microscopy; it’s about what you can see in a microscope at various levels of resolution, either with a light microscope or an electron microscope (EM). A lot of the resurgence, frankly, is with respect to light as opposed to EM, although there are some interesting developments in terms of EM and increasing the throughput with which you can collect data.
We’ve had microscopy forever, right, but what’s changed technologically are a number of factors. One is that you can capture the information digitally. We need to appreciate that. CCD (charge-coupled device) detectors that are cheap and high resolution and sensitive are a recent development. We’ve only had really great ones for the last five or so years.
The other development that’s been very important is, how should one say it, the development of genetic fluorophores that can be expressed and produce fluorescence, and that we have the entire genomes of many organisms so that we can make any part of the genome glow and any part of protein that gets produced by a genome, we can make glow.
So now, we have the opportunity to observe in vivo, in situ, in the cell directly, the expression of genes. This is much different than the chips [microarrays] which just give you a readout for how much there is in some gamish of cells. Here you can go into individual cells and see where it is, and see how much of it there is, and what the distribution is. This is much higher dimensional data, and qualitatively much more interesting information. Of course it’s harder to get, but it is getting progressively easier to do this.
JR: Are many people actually doing that?
Myers: This is going on all over the place. We have the' recombineering' now so we have ways for producing all these constructs. My friend Tony Hyman did one of the first global surveys with light of the cell division in C. elegans. That was several years ago. Erin O’Shea, I believe is now at Harvard, did genome-wide screen of the expression of proteins in yeast. We’re now doing the same thing with the fly here at Janelia [Farm] in terms of the brain and at Berkeley they are doing the rest of the genome. So these kinds of things are going on and everybody understands it.
So it’s being able to capture the information digitally, it’s being able to literally illuminate molecularly the cells. Basically label things based on molecular markers. And until recently, we may have had GFP [green fluorescent protein] but we couldn’t put the GFP anyplace we wanted to until we had the genomes, right? Really we’ve only had these genomes since 2001. Finally, I think another thing that can’t be overlooked is there are a lot of very interesting developments in terms of optical physics that we can do with light [such as] structured illumination and spinning disk confocal technology. These are all recent new uses of light.
JR: What are some of the challenges in terms of handling and interpreting all this data?
Myers: That’s exactly the point. People have the ability to generate lots of data but they don’t have the software infrastructure for handling it. Basically people are rolling their own and in a lot of cases they’re asking for help. It’s just the way it was with genomics data in the ’80s, where it was, “holy cow, how do I do anything, how do I compare these sequences, what can I do with these things? Can you tell me if this protein is like that protein?” It was basically an area where the technology was coming fast and the software wasn’t there.
That’s kind of why I’ve entered the arena. There are new computational challenges and problems, some of which can be solved in part by using existing methods in the imaging literature but they are not solved adequately by those techniques; moreover nobody is in a position to deploy these things in the contexts that are arising.
So I kind of placed my bet. I’ve sat myself down right next to the biologist because my experience is that’s really the only way that it works. If you try to do it remotely it doesn’t work very well. You have to really immerse yourself in the pipelines that are producing the data.
JR: Would you expect to have tools and algorithms emerge from your work and to make those available through the open source way?
Myers: That’s the plan. I hope we come up with some home runs – like BLAST.
JR: How far along is the work?
Myers: It’s a little bit harder in this case. I’m having more trouble finding the killer app, but I’m convinced that a couple of killer apps will ultimately emerge. One thing that’s not 100 percent certain is [whether] this business is going to get as big as the genomics business. I think it’s going to get very big but I don’t know if it’s going to get as big or as popular because it is somewhat more technology-intensive and cross-disciplinary.
JR: You mentioned in Toronto (ISMB 2008 keynote, July 2008) that you thought imaging data would become the driver in biomedical research and would overwhelm, at least in quantity, genomics data.
Myers: Well I think it will be a bigger driver of the development of new knowledge, new insights into how cells are working, how proteins are interacting and what they are doing in cells and systems. That’s my bet. It’s going to create opportunities that don’t otherwise exist and create knowledge that it would be very, very hard to get by looking at an array of numbers that tell you kind of generally how [gene] expression went up or down in a collection of thousands of cells.
I think a huge amount of knowledge is going to be generated. I don’t want to say there won’t be important things learned by biochemistry or other methods that are currently under use. We are learning a great deal. I just think that this is going to become a major source of information.
JR: I was thinking about some of the IT challenges and thinking back to Celera where you had all these HP machines humming away in the background trying to crunch through the data. What are the IT challenges here?
Myers: We’re going to generate data sets that are larger than the Celera data set. You may remember that I gave three levels of the problem [speaking in Toronto]. One class of work is looking inside the cell, to see what proteins and what various elements are in the cell; the other level is looking at collections of cells to see how they are organizing themselves and understand what kind of cell types you have and how are those cell types interacting. Finally the third level is to actually be interpreting video or other imaging data that indicates the behavior of the resulting systems like a mouse moving or a fly flying.
For example one thing is this idea of whole brain imaging in the mouse. That data set, for one brain, is going to be 4.2 trillion voxels. OK, so the human genome was 3 billion. The total amount of data in one of these brain data sets – and we will collect it in less than one week if we hit our target – is 4.2 trillion (terabytes) of raw data that needs to be interpreted. That’s a big number.
How are we going to interpret that? So yeah, we’ve got our Boewulf cluster, a big one, downstairs. And the unusual thing about our cluster, unlike say a Google type cluster, is that our machines are very high memory. We have more expensive machines and the reason for that is that images are very large and you want to operate on large 3-dimensional arrays. It’s really kind of a convenience; we’re buying our way out of a hole rather than really struggling with it. So giving ourselves the ability to handle large object instances by being able to accommodate large memory is a good idea. A lot of our things can be done in linear sweeps, but there’s a lot of data. It’s three dimensions and treating the boundaries on a three-dimensional grid is very difficult.
Of course in 1998 that’s what we did at Celera too – we tended to buy our way out, and at that time we were buying a 64-gigabyte memory, which was actually one of the largest commercial memory you could buy. Now we routinely have quite a few processors with those big memories on them. The smallest memory on any of our machines is 8 gigs.
The other aspect from an IT perspective – and I don’t think this is news – but it’s clear that moving the data is a real bottleneck. So we’re talking about just to do a compute on this thing, you’ve got to get 4.2 terabytes out of the disk system to the various processors. So you have to move huge volumes and so distributed file systems are very important to use. We have one here and we use it.
JR: Will you make tools available through open source or commercially?
Myers: I’m probably going to do open source. I think that it’s been extremely tricky to go commercial anyway in the scientific enterprise. It’s really hard to find an edge where you have exclusivity and customers are willing to pay the requisite overhead. I’m more interested in getting stuff out there.
JR What are your thoughts on informatics challenges facing next-generation sequencing technologies?
Myers: Most of what’s going on with the next-gen is [that] the instrument itself is producing data at such a high rate that it’s difficult to keep up. So one designs tiered systems and triages very simple algorithms that get the job done. If you have anything that’s really complex, then it’s hard to keep up with the data. It’s literally hard to keep up in real time with the data. On the instrument you’ve got to keep up, and you’ve got to do something in the instrument because you can’t just dump all that data out on a wire or overwhelm the consumers.
So I think that’s one thing. The other thing that’s kind of overwhelming because of the amount of data is to do the basic problems like assembly and analysis. I think those things are getting more and more challenging, and in a way I do think it represents a niche for a really good commercial entity to come in and really engineer and solve those problems. The problem with the academic enterprise is after a certain point there isn’t sufficient reward. So far, it still seems to be going the kind of academic open-source way. If a commercial entity comes in and offers users something that’s of sufficient value, then that’s a good way to go.
JR: How do you expect imaging will be used in biopharmaceutical research or in healthcare?
Myers: Let me try to answer your question with that caveat about my expertise and focus – I’m not a physician, I am a scientist. I think there’s a huge opportunity in diagnostics, in being able to molecularly mark a tissue sample. For example Walter Schubert, four or five years ago, was staining cells for particular molecular targets and actually taking histological views of those at 63x. I forget the two conditions [but] they were presenting gross symptoms exactly the same; you couldn’t tell any difference. But as soon as you put down this marker for basically the presence of an antibody, you could see that the antibody in one case had penetrated into the skin layer and in the other it hadn’t; so you have a very clear marker and you get it by looking in a microscope at the two samples.
It’s also the case that when you look at cancerous cells, you’ll be able to look at a cancer and mark it with certain reagents or proteins and look for markers, and we won’t do it by doing an expression assay; it will be about the distribution of that protein and its presence in certain cells that you can’t get from an expression array. It’s that high-dimensional aspect of actually seeing the distribution and the pattern in an actual histological context that will tell us what the disease will be, and it’s all going to be with this kind of stuff.
At this point, we can literally watch the uptake of a chemical into a cell and literally screen thousands of cells. So another thing pharmas might end up doing or may already be doing is to use high-throughput microscopy where in 384 wells you have samples you’re applying various chemicals to and can read out digitally whether the cell is dividing or not.
JR: Can you discuss the progress of your group’s projects?
Myers: It’s still early days. Most of them are recent projects that we’ve been working on for maybe a year or a year and half and the one mature one for maybe three years. The oldest one is developing a single cell expression atlas of the worm, a kind of cellular level project. Another is we’ve been looking at the biophysics of mitosis with Tony Hyman, again, in the first division of C. elegans, although it could pretty much be anything. That’s an example of an intracellular project we working on. Here at Janelia the projects include developing a complete light level atlas of the fly’s brain with its complete developmental trajectory.
We’re also working on behavioral scans involving observing the whiskers of a mouse while it’s being recorded electro-physiologically with probes, and then the other things that we’re doing is obviously we’re trying to build this high-throughput microscope to capture high-dimensional, entire volumes of brains to understand stochastically the fine-grained flow of neuronal information.
JR: What kind of a microscope is that?
Myers: You know I’m not going to say because it’s one of those kinds of things where if I did, it will be pretty obvious to people who know. I want to keep my edge for a while longer.
----------------------------------
This article first appeared in Bio-IT World’s Predictive Biomedicine newsletter. Click here for a free subscription.