Arpeggi’s Harmonious Approach to NGS Data Analysis

April 7, 2013

April 8, 2013 | While there has been considerable focus of late on new bioinformatics approaches and platforms for the downstream interpretation of genome sequences in a clinical context, there is still a lot of work to be done to improve the quality and consistency of next-gen sequencing variant calling and read alignment. Although this is the subject of significant academic pursuit, the problem is attracting interest from new software start-ups. One is Austin, Texas-based Arpeggi, co-founded by CEO Nir Leibovich, geneticist and chief science advisor David Mittelman, and chief technical officer Jason Wang. What makes Arpeggi stand out is a desire to interface with the community to help establish standards and best practices for NGS data analysis. The firm is setting up a website——to share tools and resources. 

Bio-IT World editor Kevin Davies recently asked Leibovich to outline his company’s core strengths and business proposition. 

Apreggi_NirBio-IT World: Nir, What is the thinking behind Arpeggi?
There’s obviously been growing excitement in recent years as sequencing has become easier, faster, and cheaper. In fact, it is so cheap now that for the first time, it seems like data analysis is harder than data generation. At Arpeggi, we want to streamline the analysis of next-gen sequencing (NGS). The field is still fairly new and there is substantial heterogeneity in sequencing platforms and applications, and a growing number of tools for more and more analysis steps. All of these complexities limit the pace of innovation in sequence analysis and this is before we even get to the traditional big data problems of throughput, data size, etc.

In a nutshell, we are a team of data analytics experts, software developers, and life scientists determined to simplify, enhance, and innovate the analysis of genomes.
Where does the name ‘Arpeggi’ come from? 

One of our co-founders, David Mittelman, really loves music. He has a tremendous passion for it and he once related DNA sequencing to the performance of musical notes in an arpeggio. Turns out he is also a big fan of Radiohead and was jamming to a track off “In Rainbows” when we were discussing our initial formation. So “Arpeggi” came up when we thought of company names and it stuck with us!

So the firm could have been called “OK Computer”? 

{Laughs} I’d like to think we are going to be the "Kid A" of genomics and move things in new and exciting directions.

Speaking of music, Austin isn’t exactly known as a bioinformatics hotbed. How is it as a start-up environment?  

Austin is a wonderful place to live and has been great for my previous start-up as well as Arpeggi. You can get a great deal on office space and there is a great talent pool coming out of UT [University of Texas] and the surrounding institutions. There is also a lot of entrepreneur energy here as well, particularly when SXSW takes over the city.  Growing Austin into a biotech hub will take some time, but we do have some anchors to build around like Life Technologies, Agilent, Asuragen, Amgen, and Luminex. Thanks to a recently passed bill, UT is now going to build its own medical school. High-throughput sequencing and bioinformatics are part of the plan so this will only grow the industry here further.

What is your own business background and how did you get into genomics? 

I’m a serial entrepreneur. My past ventures circulate around the common theme of Data Analytics. I’m fascinated with empowering solutions and insight through data aggregation and analysis. My first venture, as a freshman in college, was a data-driven online marketing platform. This was in the early days of the web and turned out to be quite visionary of the “Google analytics” tools that would follow years later. In 1999, I co-founded a stock trading analytics and training platform with hundreds of thousands of members earning a weekly spot on CNBC.

In 2005, I co-founded MarketZero, where we built a data mining and analytics platform that was purchased by Zynga in 2011. At Zynga, I navigated true Big Data analytics made up of the game actions from over 250 million players to help lead new product efforts.
Late last year, I became fascinated with a new opportunity following several conversations with my friend David Mittelman. He detailed the state of the NGS market and the exciting analysis technology he developed in his lab at Virginia Tech. I couldn’t sleep for days—I was wired about the vision of the impact of this newfound data source. So I resigned from Zynga and Arpeggi was born!

Who are the other co-founders at Arpeggi? 

The other co-founder (besides myself and David) is Jason Wang, a big data analytics expert. We co-founded MarketZero and also worked together at Zynga. He pioneered lots of data-driven metrics at both companies and has lots of experience handling enormous datasets. He managed a large team of engineers at Zynga and currently manages an amazing group of engineers at Arpeggi.

Mittelman is an associate professor at Virginia Tech. He runs a wet and computational genomics lab. David and I have been lifelong friends and he has been teaching me genomics since his involvement in the first Human Genome Project at UT Southwestern. He got his PhD at Baylor College of Medicine and trained with Richard Gibbs at BCM’s genome center. It is safe to say David has been embroiled in genetics and genomics his entire working life.

Who is your competition and what sets Arpeggi apart? 

The genome analysis market is so fragmented at the moment. There are numerous companies in the space, albeit focusing on different parts of the analysis process. Initially we were concerned that everyone was a competitor but instead we found that we have far more potential partners and collaborators than competitors. Great companies such as Appistry, Bina and Knome offer appliances that could benefit from our software technology. Other companies such as Seven Bridges Genomics, Spiral Genetics, and DNAnexus offer great cloud-based workflow solutions. Ultimately all these companies want to serve the best tools to analyze genomic data. Right now they mostly leverage open source tools (or tools that were previously open source like GATK). We’re determined to emerge as a preferred solution to these companies and future ones with faster, dependable, and more accurate results using the same data.

Our primary focus is in extracting the most we can from genomic datasets by using an integrated approach that scales really well on local and cloud environments. So I see an opportunity for us to bring value to these companies by providing them with more efficient tools that of course can run side-by-side with other analysis tools. We also think that the output of our tools makes a great input to companies that maintain curated databases and variant interpretation tools.
The existing variant callers and aligners aren’t perfect, but is there a viable business model here? 

There are lots of great open-source and commercial tools for mapping short reads and identifying variants, but there is room for improvement. Currently there are many steps often requiring many tools. This creates inefficiencies in having to read in and out data through each tool and can limit accuracy, as each tool must render an interpretation at any given step and pass a final answer to the next step. Of course there are challenges simply in trying to pick the right tool and configuration for a given step. Our idea was to create an integrated variant caller that accepts the reads from a sequencer and produces variant calls in two related steps—first a genome reconstruction step, and second, a variant-calling step. During reconstruction, we map reads, realign them, and in some places apply some local assembly. This is all done in one integrated step and I think it works better because we apply multiple approaches and utilize data for each approach to refine read placement on the reference. Then we can easily call variants and of course we can simply export the read alignment if you want to use third-party tools as well. This integrated design also minimizes disk IO and maximizes compute resources by scaling better across distributed hardware.

Is your software primarily designed for basic research or more clinical environments? 

We’re excited to work with both basic research groups and the clinic. We want to engage genomics groups to vet our tools and to see where we excel and where there is room for improvement. I think our tools will bring value to basic research, particularly for domain scientists who might not be experts in genome analysis. It also brings value to large projects that require the analysis of massive datasets. Our tools are designed for scale and this makes the process more automated, fast, and cost-effective both locally and on the cloud. For the clinic we bring a lot of commercial-grade features like audit trails, controlled releases, and more that are particularly critical for those environments and not real focus areas of current open source software.

Apreggi_GCATWhat is the GCAT Project and how can the bioinformatics/genomics community get involved? 

GCAT (Genome Comparison & Analytic Testing) is meant to be a collaborative platform for comparing multiple genome analysis tools across different sets of metrics. The exact metrics and data sets are crowdsourced to encourage community involvement and input. The GCAT platform features an easy interface and automatically generates compelling visualizations of benchmark and performance testing data.

It came about initially as a tool we built for internal benchmark testing throughout the company’s product development. However, we’re firm believers that everyone gains from the advancement and standardization of NGS analysis. We felt it was mutually beneficial to share this internal tool with the NGS community as a framework to encourage healthy and productive discussions around challenging topics and save the community large amounts of time and resources in conducting such tests manually as they have been.

If we can agree on standards, we can get more people into this space and this benefits everyone. We’ve enlisted the help of an advisory board made up of thought leaders who will help us curate the feedback and rank the priorities to stay focused. We will engage the community online and we are set to present ongoing results and insights at a couple upcoming conferences including the The Clinical Genome Conference that you’re organizing in June.
The GCAT project is hosted on a website called What do you hope to accomplish there? 

Bioplanet was actually a site we acquired from a bioinformatician who started it over ten years ago with hopes of making it a fun place for bioinformaticians to connect, post jobs, etc. Our vision is very much a continuation of the same. We want a welcoming environment for all members of biotech to connect. We’ll continue to provide useful resources and tools like GCAT to aid and empower the community. I can’t give any timelines but we would love to expand the discussion beyond the technical to policy and ethical issues, as well as regulatory issues.

I hear you’ve just been selected for GE’s Entrepreneurship Program. What do you hope to achieve over the course of the 3-year program? 

We just got back from New York, where we met with GE executives to discuss just that. We’re absolutely thrilled to be selected to this prestigious program and be part of GE’s ‘Healthymagination’ initiative—a $6-billion global commitment to provide better health for people by improving the quality, access and affordability of care. We’re hoping to play a large role in that vision by leveraging our technology as a key foundation block to many future consumer-facing products we can’t yet reveal.

We’ve been very impressed with GE’s out-of-the-box thinking, their interest in genomics, and genuine dedication to transforming health care. We are convinced that genome analysis and the multi-functional insight we can gain from it will be an instrumental ingredient to such transformation especially around precision medicine, rare genetic diseases, and preventative care.