Genetic Analysis at Biobank-scale- How Regeneron Scaled Informatics with Apache SparkTM

(May 15, 2019)

Sponsored by

Preview:

Webinar Description:

The field of genomics has matured to a stage where DNA sequencing projects have reached population scale. And while many organizations have invested in large genomic datasets like the UK Biobank, few have the expertise or proper technology architecture to turn these massive volumes of raw DNAseq data into actionable insights.

Regeneron, a leading biotech company committed to creating therapeutic innovations, has built one of the world’s most comprehensive genetics databases with over 500,000 exomes. On their journey to turning this data into novel therapeutic insights, Regeneron encountered numerous challenges. For example, how do you enable fast and accurate queries from >300B data points? And how do you expedite novel statistical tests on TB-scale data?

In this session, Regeneron will share the challenges they faced building the world’s largest genetics databases, how they overcame these challenges with a scalable and performant informatics infrastructure powered by Apache SparkTM, Databricks and AWS and the key lessons learned along the way.

Join this webinar to learn:

About the role genomics plays in accelerating drug development at Regeneron
What challenges they faced turning 500k exomes and electronic medical records into actionable insights
How Apache Spark, Databricks and AWS enables them to easily scale informatics and improve query speeds by 600x
Demo on a machine learning model for genome-wide disease risk scoring powered by Apache Spark and Databricks

Speakers:

Lukas Habegger Dr. Lukas Habegger

Associate Director of Bioinformatics

Regeneron Genetics Center

Dr. Lukas Habegger is the Associate Director of Bioinformatics at the Regeneron Genetics Center (RGC), one of the most productive sequencing efforts in the world. Lukas manages the Genome Informatics R&D Team which develops new algorithms to analyze genomic and clinical data. Lukas is spearheading a project to build out the RGC’s big data infrastructure and create a cutting-edge Apache Spark data analysis platform to integrate clinical and genomic data and provide advanced query/analytical capabilities. He received his undergraduate degrees in Bioinformatics and Statistics from the Rochester Institute of Technology and obtained his Ph.D. in Computational Biology & Bioinformatics from Yale University.

Frank Nothaft Frank Nothaft

Technical Director of Healthcare and Life Sciences

Databricks

Frank is the Technical Director for the Healthcare and Life Sciences vertical at Databricks. Prior to joining Databricks, Frank was a lead developer on the Big Data Genomics/ADAM project at UC Berkeley, and worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips. Frank holds a PhD and Masters of Science in Computer Science from UC Berkeley, and a Bachelor’s of Science with Honors in Electrical Engineering from Stanford University.

Click here to access