Data Management And Data Transfer Technologies Giving A Spark To Collaboration
March 21, 2019 | Bruce Kozuma understands the most important thing a scientific research institution can do is collaborate.
As a Principal Systems Analyst for the Broad Institute of MIT and Harvard, Kozuma spends most of his time helping Broad lab’s use information technology, including cloud technologies to improve and speed their research. One way of achieving that is through use of cloud technologies.
Using cloud-based data management and data transfer systems can affect the overall productivity of a study, and the results speak for themselves: reduction in cost and the increased ability for multiple labs to work together toward figuring out problems.
On behalf of Bio-IT World, Hannah Loss spoke with Kozuma about the impact data transfer and data management systems have on scientific research, the adoption of support technologies at the Broad Institute, and the challenges that remain for data-intensive research.
Editor’s note: Hannah Loss, a conference producer at Cambridge Healthtech Institute, is helping plan tracks dedicated to Cloud Computing and Data Transfer at the upcoming Bio-IT World Conference & Expo in Boston, April 16-18. Kozuma is chairing and will be speaking on the program. Their conversation has been edited for length and clarity.
Bio-IT World: Regarding data transfer and data management across multiple institutions, what has been the positive and negative impact on scientific research?
Bruce Kozuma: It's important for us to share our data, particularly, at the Broad Institute. Much of the work that we do is publicly-funded through NIH and NSF grants. So much of our mission is to share openly our findings and our methods. Inherently, we are collaborators. Our full name is the Broad Institute of MIT and Harvard, and Harvard-affiliated teaching hospitals, which includes Beth Israel Deaconess Medical, Massachusetts General Hospital, Brigham and Women’s Hospital, Boston Children’s Hospital, and Dana Farber Cancer Institute.
We also have collaborators around the world. Some of the collaborations that I've had a small part in supporting are for psychiatric studies using biological samples from collaborators in Sub-Saharan Africa. That's only one example. A tremendous amount of sampling has been done for disease-specific studies where sharing the data is very important to further scientific research around the world. However, there are technical challenges in the sharing of data and administrative challenges as well.
An example to illustrate some of the administrative challenges: I generate the data. I share the data with you. I don't necessarily share that off of a file system sitting on my premises. I may share that up to a Google Cloud bucket, and then give you access to that data in the cloud.
Given the size of data these days, it's not efficient, necessarily, to pull that cloud data to your own source, and then do your analysis and compute there. It may be more effective for you to do your analysis in the cloud. Well, what happens if I decide not to pay for the data you’re using anymore? Does your access go away? How do I know that when I share that data with you that, since that data is often publicly-funded but with specific restrictions, that I'm supposed to share that with you? How do I know that that data is consented for the particular use you've got in mind? There's a whole host of additional questions that come up these days. They all need to be solved so that we can do ethical, responsible research that affects patient outcomes.
The sheer scale of data presents new challenges. I think the Broad has somewhere around 80 petabytes of data. Each individual piece of data can also be hundreds of gigabytes in size. Just reading the file to find the part you're interested in is still challenging. As we find that we're moving to the cloud, we are now able to think differently about this large section of data, and maybe we can just take the section of the file that we're interested in rather than take the whole file. So, some of our data processing stuff that's been with us since the 70's, we're rethinking now as we move to the cloud.
Can you provide any more detailed examples about what technology supports research at the Broad Institute, sample tracking or otherwise?
We at the Broad use “technologies” as a very expansive term. We might mean single cell sequencing. We might mean micro fluidics. We might mean cloud compute. We might mean machine learning. It might be a new assay. It might be a new set of reagents. It could be something as radical as CRISPR. All of those things are used at the Broad and it’s part of our job, like it is with any research institute, to develop and apply new technologies on a daily basis. We have specialized software for doing data transfer because we have so much data that we need to share. Also, if you think of a computational tool for data processing or analysis, we probably are using it. We have a fair number of people, for example, who are using Scala at the Broad. We've got people investigating the Julia language at the Broad, and all sorts of other things in between.
We develop our own technologies that we then share with the world, like GATK, Hail, and FireCloud (which has now been rebranded as Terra) that we use internally. I have spent my time looking more at the LIMS, ELN, SDMS, LIS, LES sort of space, and we've actually used Atlassian's Jira coupled with G Suite, as sample tracking software, mostly for visualization and high-level tracking. Not necessarily for some of the more LIMS-specific stuff like freezer management, aliquots, and sort of thing, but just in terms of, "Where's my pizza?". Others some people use Trello or Asana.
Let me give you an example. At a very high level, the life of a sample at the Broad begins well before the sample even exists. People decide they want to do a study. They want a certain cohort that they want to study. They get their studies funded, and make sure that the study is all cleared by Institutional Review Boards. They go out and collect the samples and, then, we have a whole bunch of activities that kick in. How, precisely, do we get the sample in house? How do we know that that sample is coming? How do we prepare that sample for analysis? How do we track it once it's in analysis? How do we track it to its data analysis portions? Then, how do we know that we've delivered the data to the people who need to do further analysis on it?
If we break those steps into before the sample is sent, tracking the sample in transit, processing the sample before sequencing or proteomics, and then tracking it through data analysis, you can think of that as, just say, five columns — Before Shipping, In Transit, Preparation, Processing, and Analysis — that's just a simple five-column board for us, where we will know, "OK, we're going to get six samples from this region so that people know that they've got enough materials to do their work, and the people can plan their workload. Now we've got seven samples on-site, two more are in transit. We've got three of those in process. Four of them are already into data analysis and, then, we've already just shared that data out with two others.” So, it's a way for project managers and Principle Investigators to understand where things are so they can make sure that they're meeting their research goals, that they're reporting correctly back to the funders on how the funds are being used, all of those sorts of things.
As you mentioned before, funders are very important and that's caused you guys to operate through collaboration. It seems like you share all, if not most, of your technology. How can other labs be encouraged to adopt, say, the sample tracking technologies? Is it through this open-sourced sharing or are there other ways?
One method is through open-sourced sharing, as the Broad has done with Terra and Hail. Some of it's just straight-up knowledge-sharing. I started at the Broad as a consultant. My job was to come in and look at what the Broad was doing around LIMS, in particular, and see if there's something that we do that would be applicable across the institution. As I investigated certain areas, I looked and said, "Oh, you've adopted lean manufacturing techniques such as Kanban as a technique for the processing of genomes.” So, a lot of it is simply applying things that have been found in other industries. There was a very good paper by Sheila Dodge on applying manufacturing techniques to the manufacture of genomic data, for example. So one way is through the traditional papers.
Another way is through collaboration. There are several collaborations that we're working on, with Harvard for example, and with other labs, where we're hoping that if we can come up with standard ways of tracking of cell bank information, for example, that other groups involved in the collaboration will adopt the same techniques. That will harmonize the data itself and it'll save those other labs the work of figuring out some of this stuff. Our conversation might go something like this: "Here's how we do it. You can adapt it to your lab, but if we keep it roughly the same at this point of the interface and at this point of the interface, you can do whatever you need to in your lab but, then, we know we'll have an easier time of comparing apples to apples when it comes to, 'How are you processing your examples? Why are we getting different results? Oh, wait. We're doing this extra step that we found out about. You didn't know about us the last time we talked. OK, maybe that'll improve the quality to the data or improve the number of samples that have the right kind of results that we're hoping to find.'"
That's what I love so much about these cloud projects. It shows the importance of communication, as you said. Now, I know you're specifically focusing on a low-cost sample tracking system. Is cost one of the biggest challenges, or what challenges do you think remain for data intensive research, moving forward?
Cost is definitely one of those dimensions that needs to be discussed. It's not the only one, but it's one of the major ones because of the scale of data. Again, the Broad has somewhere around 80 petabytes, I think, of data under management, and a very large percentage of that is in the cloud. Well, you can look up what the costs are according to Google and then figure out what that costs per day; that's a lot of money spent on data sitting there.
Some of the things that we do is we say, "Okay, if you don't really need it sitting out on the kitchen counter, can you put it in a closet? If you don't really need it in the closet, can you put it in the basement?" Each of those things has a different cost. So, if your study is ongoing, and you're still collecting data, and you're still doing analysis, and you're still writing the paper, then, yeah, you may need that out, right on the kitchen table for everybody to see. Once the paper's in, and you've already gotten the comments back to the committee, maybe you can take that data and move it into the closet where it's still available and not costing you quite so much. Then, once you've moved on to the next study, and the papers out and the inquiries have stopped coming in, keep the data because you need it for reproducibility, but really make it low-cost and move it to the basement.
That is a very manual process right now, partially, that's just because everybody's needs differ but, then, the actual process to then move that stuff is also a manual process. One of the things that I know we've got a lot of people working on at the Broad is how do we say, "Okay, once you give me the signal to move this stuff” or, better, I send you a signal of, "Hey, you really haven't accessed this data in x number of months. Maybe you want to save yourself some money and move it over here.” That just happens, and then the researchers don't have to think about it. So thinking about cost reduction is an important thing.
I also mentioned the size of data. Since the data is so big, maybe you don't want to incur the cost of reading in a hundred-megabyte file when you really only need the two megabytes in the middle. Cloud, much like computers have always done, can take a process that you know how to do and automate it. Then you look at it and say, "Well, why are we doing it like that? Rather than reproducing in the cloud the way we do things on premises, why don't we rethink the process? We could do things the old way, but there are a bunch of new ways of doing things possible on the cloud. Let’s look at the intent and figure out the best way to get that thing done using the cloud." You can achieve the same result done completely differently.
That process rethinking requires rethinking tool sets and mental approaches to problems, which takes time, and it takes this collaboration between the molecular and computational biologists, between the clinical people, between the IT types and the data scientists, to say, "Well, maybe it is that two megabytes in the middle you need, but you need this other megabyte over here because that's where the quality data is, and you need all of those. So, let's devise some very efficient algorithms to get that kind of information out and, therefore, save you research funding." Now you've got something that's a very different kind of thing that requires an interdisciplinary approach.
That points to another challenge beyond cost is the fact that you need so many different skillsets together to tackle these problems. Finding those under one roof is one of the reasons that the Broad exists: to put all those skillsets together. Even then, we still are reaching out constantly, trying to find who's got the best stuff inside and outside the Broad, and how do we take advantage of that to move things forward because we're not doing this out of academic interest. We're doing this to improve human health.
I think, personally, one the great strengths that the Broad has is so many of our researchers are MD/PhDs, or have labs directly in teaching hospitals, so they often have patients directly in mind. They can picture them in their mind's eye as they do research. They know they're going to be with them in the next number of weeks. Research is an urgent, personal thing. Psychological studies have been done that show that if you have a coupon that is exchangeable for another coupon that's worth $2, you don't value that second-level coupon that much, but you do value the $2 one. If you can remove those indirection pointers, in computer science speak, then it means more to you, and you've got greater urgency around any specific Research project. That's something that's great about the Broad is that we know we've got patients in beds, we can practically picture them. With such large teams, that personalization, that social thing, becomes that much harder.
So, yeah, there’s a whole new set of issues cloud presents. How to deal with large-scale data is an issue. Cost is an issue. The social component is an issue. Rethinking the tooling for the cloud is an issue. Collaborating at scale is an issue.