Yaron Turpaz on Human Longevity's Big Plans

July 27, 2015 | Human Longevity launched in March of last year, the latest of J. Craig Venter’s genomics startups, headquartered in San Diego. The vision was grand: sequence 100,000 human genomes a year and "change the way medicine is practiced by helping to shift to a more preventive, genomic-based medicine model,” Venter said at the company’s launch. “Our goal is not necessarily lengthening life, but extending a healthier, high performing and more productive life span."

The company started with an $80 million initial investment that Venter expected to last for about 18 months. On the eve of that milestone, Bio-IT World editor Allison Proffitt spoke with Yaron Turpaz, HLI’s Chief Information Officer, about the first year’s progress and the kind of knowledgebase HLI is building.

Turpaz joined HLI at the outset with experience leading R&D IT at AstraZeneca and Eli Lilly. He was tasked with building and growing the company’s genomic and phenotypic database business; leading all bioinformatics/informatics and software engineering efforts; expanding informatics program at HLI’s California facility; and building HLI’s computing and informatics program and facility in Singapore.

It’s a daunting list, and one Turpaz wouldn’t mind some help with. The company lists 56 open positions for researchers, bioinformaticians, software engineers, and clinicians in San Diego, Singapore, and Mountain View, Calif. The goal for all of them is to build a million-genome database replete with phenotypic and clinical data. Then the fun will really get started.

Bio-IT World: Your mandate at HLI is vast. Tell me your take on the company’s vision in light of those responsibilities.

Yaron Turpaz: The umbrella goal of what we are trying to do is simply to transform healthcare. We would like to move from reactive medicine to preventive medicine, so basically not just go to the doctor when you get sick, but really [use] the different genomics and other ‘omics elements to define the baseline, assess risks and track changes to identify peaks early enough so we can translate [data] to care from a preventative approach.

In addition, a lot of what we’ll do will be used for novel target discoveries, biomarker discoveries, and patient stratification, including the rescue of failed clinical trials. The core of what we’ll do is to build this knowledgebase with hundreds of thousands of genomes that are then linked to whole genome microbiome and metabolomics on a large scale and, when possible, a full body MRI imaging and other assays. We will try to integrate this information and do a lot of downstream analysis — kind of classic bioinformatics just with the right statistical power and with multi-dimensional data. In addition, we are using machine learning to explore the data without prior knowledge and try to understand what comes out.

What does the infrastructure look like to capture that?

We established the largest sequencing facility in the world, and we have now a capacity of almost 40,000 whole sequenced genomes at a 30x coverage per year. We have 24 of the HiSeq X10 machines from Illumina, and this was a major effort initially just to put the lab in production and build the required IT and informatics systems to handle the vast amount of data in an unprecedented scale and speed. We are for the most part cloud-based, so we try to maximize the use of the cloud where possible.

Basically, as the data comes out of the sequencing machine, we have hot storage that takes the data right before we upload it to the cloud. We are using EMC Isilon platform as hot storage and then transfer the packaged data to the Amazon Cloud for all of our production and downstream data analysis.

The projection is that we’ll expand our sequencing capacity based on the demands as we sign on more significant contracts. We landed an agreement with Genentech and we’re in discussion with other pharma. As we land more contracts with requirements for more [sequencing] machines, we’ll increase the throughput. We’re building now the commercial HLI Knowledgebase that different companies can subscribe to, to extract the knowledge and mine for transformational discoveries.

The things that you’ve mentioned—whole genome sequencing, microbiome sequencing, full body MRIs and imaging—where do these data come from? The agreement you mentioned with Genentech, for example, is only for de-identified genomes.

Correct. In general, when we establish collaboration of large cohorts, we retain the rights to the data, as well as the de-identified phenotypic and clinical data. We are not a fee-for-service sequencing facility, so it’s not about sending us your sample, just to get your data back. We will only sequence when there’s high quality phenotypic and clinical data that comes with the samples, and they will be part of the HLI Knowledgebase. This is the main source of information for the hundreds of thousands of samples based on existing cohorts. Plus, we are now working with pharma to change the consents of the clinical trials, so moving forward on any new clinical trial, this would be part of the collection. You can imagine in the future that every clinical trial will require full genome [sequencing] on each patient.

Craig Venter has said that the company is going to begin to collect 3D scans of peoples’ faces as well. How does that fit?

We recruited 1,000 individuals to start with and we conducted the full face analysis and linked it to their genome and many other parameters. This is just the beginning. We are going to open this year our first HLI Health Nucleus in San Diego as an entry point for individuals to come in and go through a full whole-genome analysis, full body MRI, and other detailed phenotypic data that we are going to collect. At the first stage, our Health Nucleus will be established as a research use only, proof of concept of what can be done, and all information of individuals will always be communicated by physicians.

The volume of data will come from cohorts and collaboration with pharmaceutical companies, hospitals, and research institutes. At the moment, the healthy individuals or patients that will come [to HLI’s Health Nucleus] will not be the major source of samples compared to the hundreds of thousands of samples from clinical trials and [collaboration] cohorts.

Are you following these individuals longitudinally?

Yes, that is one of the advantages of the Health Nucleus. For example, we can collect stool samples on multiple time points for microbiome analysis. We also consider to collect saliva or skin samples for the different sources of microbiome populations. We developed whole metagenome analysis of microbiome. While most microbiome offering out there by different companies only focus on the 16S gene for taxonomy, which we know is not sufficient—not for detailed species analysis, and also not for downstream integrated pathway analysis that we do. We believe it’s important to follow up and continue to learn about the microbiome population in integration with the whole human genome and then have better understanding of how it relates to triggering different diseases.

Some of the existing cohorts—for example the collaboration with Prof. Tim Spector from Kings College, the largest twin cohort in the world [TwinsUK]—have consented to follow up. Many of the cohorts that we collaborate with actually have this element. The process of follow up with cohorts participants to collect more samples is a bit slow but it’s possible to do it. And then, of course for any future studies that we’ll initiate with these institutions the studies will be designed longitudinally.

Is there a biobank? Are you keeping any tissue samples or everything is digital?

HLI focus is on the digital elements of individuals, and we may partner with different sources of samples, including biobanks that are interested in genomics characterization of their samples. The goal is to generate the data and do all the required analyses. We follow all the compliance and restriction of each study and either destroy samples or send it back.

So data are coming in from pharma collaborations as well as the private San Diego Health Nucleus, and then your job is to store, analyze, and figure out what to do with all of that data?

Yes, pretty much, part of [our goal] is to generate the data in high quality, and then, as you say, store it, as well as develop solutions that will allow you to do the downstream analysis. The analysis can be processed in-house in a batch mode via our APIs or by pharma and physicians through a user interface when subscribing to the HLI Knowledgebase. We have diverse sources of data and are working closely with global pharmaceutical and biotech companies, hospitals, research institutes, academia, government agencies, biobanks and insurance companies.

As you build these solutions, what do you see as the biggest challenges?

In the first year, the main priority and goal was to establish the production lab and the production analytics as we continue to optimize it. What we focused on this year is the downstream. So now that we have high quality genomes and high quality data stored on the cloud with highly optimized analysis workflow, it’s about the downstream multi-dimensional analysis solutions and biological interpretation. And as you can imagine, establishing the Health Nucleus also requires a major informatics effort.

What are your target numbers for the knowledgebase?

Our goal is to have at least a million genomes in the knowledgebase by 2020, and to reach those numbers in this timeframe, we obviously need to scale up in capacity and technology. The assumption is that capacity needs to be increased, and also sequencing technology will change and get better in speed and quality.

We’re technology agnostic, so right now we have the Illumina machines and we also have two PacBio machines. Our goal is to work with the best technology out there, for the best solutions, quality, capacity and cost, and we will change and adapt as these technologies are being developed, so let’s assume that this will advance and progress in the next few years and we will continue to tap into that to reach to at least one million genomes.

You said one million genomes, but you also said you don’t want data that doesn’t have phenotypic and clinical data with it. So your goal is one million genomes with their accompanying phenotypic and clinical data.

Exactly. We believe that it’s essential for the downstream analysis and better understanding of the different elements we try to achieve towards transformation of healthcare.

As you’re gathering this data and you’re storing it on the cloud and running your analyses on it, how do you pick what to start working on clinically? Are you going to just mine the data and see what jumps out? Are you going to make decisions with your partners on what they want you to look at?

I think in reality, it’s a combination. Initial analysis is driven by the interest of our partners. The different questions that’ve been asked by the different cohorts, depend on the partnership and the type of data. Our goal is of course to enable our partners to get novel insight and focus on those studies. A key value that HLI brings to the table is the aggregated analysis across all genomes in our Knowledgebase so a natural next step in the analysis is to integrate the cohorts data with other relevant cohorts and generate or test new hypothesis.

In addition, there are specific studies such as the [3D facial scans] which addresses specific questions of understanding the genome and how it relates to the physical traits in the face. Our approach with the machine learning is to let the data speak. We’ll identify associations and links that relate to interpretation of phenotypic data that we didn’t even think of asking.

It is a bit too early for that. We need a critical mass of data to actually start to have such conclusion with enough statistical power, this approach will add value as we’ll grow and build the knowledgebase.