H3’s Data Centric Approach to Cancer Genomics

June 22, 2021

June 22, 2021 | TRENDS FROM THE TRENCHES—Lihua Yu began her training as a biomedical engineer during her undergraduate studies, but for her Ph.D. training, she chose to study in the lab of Dr. Temple Smith, a pioneer in computational biology at Boston University, despite being warned that this field had minimal job prospects. She didn’t know much about molecular biology then but recognized that Dr. Smith would be a strong mentor and was intrigued by his comment about needing a fresh pair of eyes to look at a problem the lab was focusing on: protein structure prediction.

“I really liked Dr. Smith as a person and as a scientist,” Yu remembered. “He always asked the most fundamental questions. He let you cut through the complexity to truly understand, at its core, what the question was and what the solution was about regardless of how complex the algorithm appeared to be.”

Dr. Smith was also incredibly attuned to his students’ development as scientists. Yu added, “He treated us differently by tailoring how he either motivated us, encouraged us, or worked with us more directly.” His leadership and mentorship provided students with what they individually needed for their development and success, and this had a great influence on Yu.

Yu first joined Astra Bioinformatics Center and later joined the newly-formed US Cancer Discovery team at AstraZeneca where she gained firsthand experience with drug discovery, translational medicine and biomarkers.

In 2011, Yu joined H3 Biomedicine (H3), a precision oncology company, right around the time that The Cancer Genome Atlas (TCGA) open access data were made available, in a key leadership role as head of bioinformatics. Leading this new team, Yu played a pivotal role in establishing cancer genomics as a key platform of the company’s scientific engine.

“When you are actually at the table making decisions, you appreciate the importance, nuance and limitations far deeper than when you are observing them,” Yu said. “I think the first five or six years as part of leadership—actually driving the portfolio, being part of the decision-making— deepened my understanding and appreciation of discovery and translation.”

Three years ago, Yu took on the role of President of H3, overseeing R&D and corporate development operations. She recently sat down with Stan Gloss, founding partner at BioTeam, to discuss her career trajectory and how the ideas and technologies fueling precision medicine have evolved since she joined H3. Bio-IT World was invited to listen in.

Editor’s Note: Trends from the Trenches is a regular column from the BioTeam, offering a peek behind the curtain of some of their most interesting case studies and projects at the intersection of science and technology.


Stan Gloss: One of the things that I love about what you've done in your career is that you picked focus. You got really good at cancer genomics. So, tell us about the mission of H3 and the focus that you have.

Lihua Yu: H3 was founded in 2011 by Eisai. That's the year when the TCGA started to release data. Our scientific founders, Dr. Stuart L. Schreiber and Dr. Todd Golub—both world-renowned scientists from Broad Institute and Dr. Schreiber was on the Scientific Advisory Board for Eisai—made a convincing pitch to the CEO of Eisai, Mr. Harou Naito, that the future of cancer discovery and development is precision medicine. With TCGA in full speed and open access data available, now was the time to focus on precision medicine. That's when Mr. Naito made a very bold decision to found a new company focusing on precision medicine in oncology. At that time, Eisai was actually facing financial challenges as the company’s top-selling medicine, Aricept, lost its US patent protection in November 2010.

What did you mean by precision medicine in 2011?

In 2011, our expectation of precision medicine was that the cancer genomics data from public initiatives such TCGA and the International Cancer Genomics Consortium (ICGC) would provide new cancer driver genes with somatic mutations as novel targets in addition to patient selection strategy at the beginning of the drug discovery program. Until then, biomarker and patient selection tended to be developed during the translational phase of the program, after the discovery phase.

How do you approach that today?

Initially, we refer to genomic data as mostly DNAseq and RNAseq data from cancer genomics initiatives. In recent years, we would use the word “omics” since the molecular profiling data we access are beyond DNA and RNA. Additionally, our field has progressed to appreciate how essential genomics is to the field of oncology from understanding the tumor initiation driver mutations to the resistance mutations across the whole journey of the patient, or even how to monitor the efficacy of a treatment using circulatory tumor DNA.

To me, it is really broadly defined. How do you take advantage of such a technological advancement in the genomic era? Having the DNA of the patient is not enough; we want to know what treatments they have gone through and how well they responded, as well as how their genomics profile changed during and after each treatment. This is why we are investing and working with clinical genomics data that are either generated from our own trials or by our partners and with companies that provide real-world data from electronic health records.

The core of H3 is about leveraging data science propelled by technological advancements. We do have access to the “right” data—or the right data are being generated—in many cases to inform our drug discovery and development efforts.

In oncology, you largely have two directions: tumor cell self-autonomous targets, where you're really targeting tumor cells, and immuno-oncology targets, where you’re targeting something about the dynamic between the immune cells versus the tumor cells. H3 is expanding beyond self-autonomous and looking at the immuno-oncology targets. But consistently, in any field we're going to get into, we ask: Do we have the majority of the cancer genomic data to shed light on why we're working on this target? What are likely to be the diseases and the patients for whom we are taking those drugs into clinical development?

Once in clinical development, how do we continue not only translating the critical information to clinical development, but how do we look into our own patients’ data together with the external data to refine our strategies? What would be the likely genomic lens to look at data and identify the relationship to the mechanisms-of-action of our own compound, helping to refine our disease and patient selection hypotheses.

While I think the common theme will still be precision medicine, the term will be used broader than just one mutation, one target, and one patient's selection strategy. We need continued genomic information to tell us how to refine our patient selection strategy. And on top of that, we need to continually ensure the data are correctly being integrated and made readily available to the right people.

Another element that is important is the democratization of data access—making the data available to the domain experts beyond data scientists. This could include biologists, translational scientists, clinicians—we need to include the people who know their patients, target, or experiment better than a data scientist. We need to grant the right people the data and the tools for them to ask questions and to come up with innovative ideas.

While we need data scientists and analytical capabilities, we also need the tools and culture to democratize data access to the broader audience. Only this will have a real, actual impact on R&D. If only a specific group of data scientists control the data, you're not going to have a broad enough impact.

What technology choices have you made to be sure your experts are getting the best data to work with?

H3 has evolved significantly since its founding. We made an early decision to use next-generation sequencing (NGS) for RNAseq rather than array-based technology such as Affymetrix. We went in this direction within my first year at H3, even though most of us didn’t have experience with NGS. We also started to build our infrastructure in the cloud. I sent my first two hires into BioTeam’s Amazon Web Services (AWS) training in April 2012. I think we were quite early compared to most organizations. These were purely technical choices.

From an accessing capabilities point of view, I also started working with a company (now part of ThermoFisher) called Compendia Bioscience—that was actually cleaning, integrating, and mining cancer genomics/TCGA data for target identification.

When you have a relatively small company, it’s important to focus your people on the thing, on the science that they need to work on rather than focus the team on operations.

To me, being able to run next generation sequencing is operations. Very early on I made the decision to work with BGI as a key vendor. In terms of the consistency, the quality, even the cost, BGI is really better than if we do our sequencing internally. If we don't run sequencing day in, day out like CROs, our quality wouldn’t be nearly as consistent. And there's a huge amount of overhead without proper return on investment.

H3 is a drug discovery and development company. We're not a diagnostic company or sequencing company or even a real-world data company. I always choose to work with the best partners.

Beyond NGS, what other modalities are you using? CryoEM? Others? I think making smart decisions these days is based on a multimodal approach to different data combined together to give you the big picture.

Yes! From the more computational biology side of the data, we started with NGS. The genomic data clearly evolves to include functional genomic data. If you look at the huge amount of data like DepMap (Cancer Dependency Map) coming from Broad, they are using CRISPR to perturb the genome. There's a functional consequence. We also worked with Broad on the CLUE.io portal, using data from the LINCS project. In that case, the cell lines were treated with either small molecules or shRNAs and were measured for changes before and after treatment at the mRNA level.

We are also starting to work with single-cell sequencing data. The technology is getting more and more mature and there are increasingly more public datasets. We are also evolving into this area.

In the clinical space, we work with real-world data. We have collaborated with Foundation Medicine for many years, [working on] how to use clinical sequencing data to compliment the TCGA-type of public initiatives. Initiatives focus on a primary disease—they have a defined number of samples, then you finish, and the initiative is done. But clinical sequencing continually evolves as long as people are being diagnosed with cancer and physicians are sending their samples into clinical sequencing labs to inform their decisions.

And the huge advantages of this are twofold. One, of course, is that the dataset will continue to grow. So right now, if you compare Foundation, or even Guardant Health, versus TCGA, you know it's a magnitude bigger data set.

The other advantage is that because there are new cancer patients consistently coming into the clinics, not only do you have a snapshot of the most mutated gene or what is the likely driver mutation of tumor initiation, you can also see the evolution of the cancer genomics. They are not static. If we sequenced all of the cancer patients 50 years ago compared to today, they would be different since naturally patients have been treated differently over the decades. There are different resistant mutations coming or being selected. I think the clinical sequencing data also gives us a huge advantage to understanding what other mutation resistant mechanisms might be coming or why a certain patient may or may not respond to the current standard of care.

You mentioned single cell technologies, what do you find is an advantage, and what piece of the puzzle does that give you that you don't get from just genomics?

While we are in early stages, we are testing this technology for our immuno-oncology projects. Obviously, when you have bulk DNA sequencing or bulk RNA sequencing, you are guessing that the target or the mutations you're interested in are present in certain cell types. We have all kinds of computational ways to deconvolute or estimate what kind of different immune cells might be in this mixture of tumor and immune cells, but you never know for sure what immune cells you truly have. For a particular gene you are interested in for an IO hypothesis, you have to ask, “is this gene present in tumor cells, or in particular types of immune cells?”

I think the advantage [of single-cell sequencing] is for us to obtain a much more precise understanding of the immune target we're working with. In clinical trials, you monitor the patient response, and you do sequencing before treatment. But now, you can monitor whether your T-cell repertoire is changing when you deliver an immune therapy or whether the immune infiltrating population is changing in the direction your drug is designed to influence.

I think we all agree that cancer isn't about tumor cells only. It’s a dynamic of cancer versus the microenvironment, including immune cells, and other cell types in a tumor microenvironment. And before we had such a good tool, it was all guessing. You had suggestive scientific evidence to say you're probably right, but you wouldn’t know for sure.

Now I think [single-cell sequencing] allows you to understand where your target is and how your targeting strategy is affecting the tumor microenvironment and the tumor-versus-immune cell dynamic you were in search of. That’s the high-level goal we are trying to achieve.

We haven't done a huge amount of work at H3 yet to see if this works. I can’t give you an example that altered the direction of our project—we haven't reached this point yet—but it’s exciting to think about the potential of such technology.

You're now dealing with an explosion of data. What do you do at H3 to not only collect the data, store the data, but make sure that the data is capable of being shared? Are you thinking about making your data FAIR complaint?

As much as we can, we try to move the team into this direction. Typically, researchers focus on the technology, the task at hand. People don't think about data management, data sharing, reproducibility and reusability. So, someone generated the data, one data scientist analyzed the data, and they tell the team, "This is what we found."

What I'm trying to work on with my data science team, as well as other key stakeholders, is to talk not only about the data analytical and storage capabilities we need, but also the necessary data management capabilities in the very beginning, even when we haven't generated a single piece of data yet.

All three of these elements are happening at the very beginning. Do we do well in all of them at this time? Absolutely not. But from how I drive the organization's point of view, we never say, "Oh we need single-cell sequencing. Let's please make sure we have a machine. Let's please make sure we generate the data. Please make sure data scientists already sent the data back to the project team, this is good enough.” Instead we’ll discuss, "Oh what are your storage needs? How are you going to manage this data if somebody asked to reuse the data again? How do we plan to aggregate all of the single-cell sequencing data we generate to share with others in the organization?”

Unfortunately, most people do [the planning] afterward, after the data become unruly. Most people don't even think they need any data management tool until five years later. Then they say, "Sorry, let me bring in a consultant to help me clean up the data, re-annotate the data, etc." At this point, you have already lost quite a bit of opportunity, and some information is missing because you didn't think about data management from the start of the process.

We see people solely focused on the primary use of the data that they generate without ever thinking about how that data is going to be used again. I think probably 80 or 90% of data that's put into a file system is never looked at it!


The more we can setup a data management system such that the data is touched more, the more it's touched, the more valuable it gets. Somebody else looks at it and connects it to another piece, and then it just keeps connecting and growing and growing in value to the organization.

I think data scientists, like us, bear a huge responsibility for why our key stakeholders don't see it this way. We tell them the word FAIR, but what does it really mean? We say, “In the future, the data will be valuable,” but for most of them, their priority is today. Our data scientists aren’t thinking this way enough.

In the old days, you didn’t need as much of the kind of data management we're talking about today because the biological data was low throughput. Now, we're talking about magnitude differences of throughput. This fact alone mandates proper data management.

The most important thing is the data’s comparability. If you look at images of western blots compared to NGS, or even compared to Affymetrix, why should you bother to aggregate the data if it's not comparable?

You can only look at one western blot. With the variables—how long they exposed them and all of that—there is nothing you can aggregate. But for the new generation of data we are talking about today, especially digital native data, the data are much more objective, aggregable and reusable. You get much more power by data aggregation and data reuse.

And we have the ability to actually do machine learning on it and having valid training sets. The power of training with high quality data is to really hone your analysis to be able to sense the right signal.

Absolutely. I think that's why having the right way of looking at the machine learning promise is important. Be open-minded! Machine learning might mean you are learning something you don't know yet, so don't be too skeptical.

On the other hand, simply having a lot of electronic files doesn’t mean you can do machine learning. You need to recognize whether you have the right kind of data to inform the process. The shift is clearly happening, even though we're still working through all of the culture, organizational, and technology challenges. I think now if you visit a small biotech company in Cambridge, most of them will have an informatics or computational biology group, while also pushing for a chief data science officer role to lead the organization.

In general, we have intuitively arrived at the understanding that data are an integral part of R&D. In the past, nobody would say, “I'm starting a drug discovery company, but by the way, no biology or chemistry department.” But data science has always been a question. “Do you really need data scientists? Maybe you can simply hire one bioinformatician as a consultant since we have only two NGS datasets."

Today, people are increasingly aware that you actually need to build this kind of data capability as part of the overall R&D infrastructure of a company. I think with this kind of a cultural shift, we'll be less in the position of looking back to say, "We have a lot of data, it’s very messy, please come rescue me."