Causation in Context: Brendan Frey's Vision for Deep Genomics

August 21, 2015

By Allison Proffitt

August 21, 2015 | Last month, Brendan Frey got a flurry of media attention as he and his cofounders formally launched Deep Genomics, a company whose mission is to combine deep learning with genomics to model the body.

By the time Frey and I got a chance to discuss the venture, his company-genesis spiel was perfected, rolling off the tongue in one 1,200-word monologue.

But Frey isn't a pitchman. Frey’s background is in machine learning, specifically computer vision and speech processing. About twelve years ago, Frey started asked questions about genomics, and was dissatisfied with the answers. As he dug into the literature, he found genome-wide association studies were common and the methodology struck Frey as particularly “ill-conceived.”

“They address questions which I think are kind of superficial,” he explained. “You try to just correlate mutations with diseases. It doesn’t get at the causation. It doesn’t get at why the mutation is a problem.”

Frey

Frey directed his lab at the University of Toronto to apply machine learning techniques to the genome: to read the “text” and predict the phenotypic outcomes. It would take a strong foundation in machine learning and an equally deep understanding of genome and cell biology: “to understand how DNA wraps around nucleosomes like chromatin. Understanding how RNA secondary structures appear. Understanding how proteins bind to RNA, how proteins bind to DNA. Understanding all of the biological components and cellular processes that underlie cell biology.”

Frey equipped his lab with both skillsets and started publishing: 47 papers since 2003 in Nature, Science, Nature Biotechnology, Nature Genetics, Bioinformatics, and more. Only two of which were GWAS studies.

But Frey soon realized it wasn’t enough to just publish the studies. “About a year ago my cofounders and I decided that if we really wanted to change medicine we needed to make a company.” Frey hired many of his University of Toronto lab group, and Deep Genomics was born.

Shallow Learning

The “deep learning” that serves as the foundation of the company’s vision is a rich computational dive into how the cell works. Frey contrasts it with most popular approaches, what he calls “shallow learning”.

“If you look at the most popular techniques, like, for example, 23andMe, the method it used to try to identify mutations could be problematic. It’s very simple. It’s shallow learning, if you like. It’s just basically correlation analysis,” Frey says of the process of looking at an individual point mutation and correlating that with a disease outcome.

The body is much more complex than that, Frey argues. The effect of mutations depends on context.

“For example,” Frey offers, “the effect of a mutation might depend on whether or not there’s RNA secondary structure... Like maybe the functional impact of a mutation is that it disrupts the ability of the protein to bind. However, in one person you might have a particular RNA secondary structure that’s present, whereas in the other person you don’t have RNA secondary structure, and the presence of that secondary structure influences whether or not the protein binds… The very same mutation can have a different effect in another person.”

What Frey calls shallow learning methods can’t tease that effect apart. However, deep learning, he says, combines layers of information computationally to form more and more complex models and predict biological outcome based on the model.

“Deep learning enables you to combine a lot of information together from the sequence, a lot of contextual information, to make a prediction that really is context-dependent,” he says. “That enables our system to take into account context.”

Frey’s Law of Small Numbers

Because Frey isn’t relying on correlation, he isn’t looking for large numbers of samples. In fact, he’s just as skeptical of the race to millions of genomes as he was of GWAS studies.

“Genome sequencing is becoming cheaper and cheaper and cheaper. And so, people say, let’s just do more of it… Just because it’s cheaper doesn’t mean you should do more of it. But that seems to be what people are doing… because the only tool they have is correlation. So their conclusion is sequence more genomes. Sequence lots of genomes and maybe correlation will work. Yeah, I think it’s misguided and it’s just because people don’t know what else to do.”

Earlier this year, Frey’s group published a Science paper on a machine-learning technique that scores how well genetic variants affect RNA splicing (doi: 10.1126/science.1254806). The work was based on the sequences of five autism genomes.

When Frey talks about that research with other people in the autism community, “they’re surprised,” he says. They think: “’You should need thousands of genomes to do that kind of thing.’ But the point is that if you can understand the text of the genome, which our system can do, then you don’t need that much information.”

Although he’s skeptical of comparing millions of sequenced genomes, that doesn’t mean that Frey isn’t using—and doesn’t need—a lot of data.

“You can think of our computational system as a cell simulator, if you like. It’s kind of like mimicking what goes on in the cell. So given DNA, our computational system can tell us what we’re going to see within the cell. Of course it’s not perfect. It makes some errors, but so do experimental techniques. So we’re constantly in the process of comparing our predictions against experimental data and looking at the discrepancies, and seeing what’s going on.

“Quite often we find that the experimental data is wrong and our predictions from our cell simulator are correct. And then sometimes we find the opposite: that our predictions are incorrect and the experimental data is correct. So there’s sort of this iterative process within the company of refining the models, refining our computational system, and comparing it against data.”

How exactly that works is the trade secret, but the company is continually feeding new information into its program from both internal research, published literature, and the proprietary datasets of its partners that reveal causal relationships within the cell.

“We know for certain that the genetic sequences in DNA are crucial for generating molecules, like RNA and proteins,” Frey says. “There’s no doubt about that, it’s a causal relationship. Our approach is to develop machine learning techniques that can mimic these causal processes that are well-understood and that way we can really trace back the disease to causal factors in the DNA.”

“For example: chromatin interaction where one piece of chromatin sticks to another piece of chromatin. There will be a new dataset on that to show that this happens,” then, “Can we train a machine learning model to take the DNA as input and project when there’s going to be a chromatin interaction? To do that we need a dataset. So we need a dataset of chromatin interaction, so we’d use whatever data is available for that purpose. Or in some cases in collaboration with our partners we would obtain experimental data that would allow us to do that.”

Even then, Frey’s system doesn’t demand whole genomes or even whole exomes to make predictions. Deep Genomics trains the system by giving it short pieces of DNA and corresponding protein measurements within the cell, Frey explains.

He compares the process to teaching a child to read using short words and their corresponding pictures. The child starts by matching words to pictures, but eventually the child learns to read and can imagine her own pictures to match any series of words.

“We’re showing our computer system a whole bunch of examples of short pieces of DNA, along with the corresponding measurements of what’s going on in the cell. And we show it hundreds of thousands of examples like that and then the system learns to figure out how DNA generates these pictures—or these protein levels—within the cell. And the same amazing thing happens. Once we’ve done that, we can now give our system a piece of DNA and it can predict what the picture looks like, if you like. It can predict what’s going to happen to the proteins in the cell. And also it can do it for DNA it’s never seen before.”

When the system encounters a mutation, it doesn’t need to have seen the mutation to recognize that it will change the text and thus change the protein level in the cell. In fact, Frey says the system doesn’t need any disease information for the patient. Linking the protein levels in a cell to the disease is, “a secondary step that actually is pretty easy to do.”

Knowing What We Don’t Know

It’s an elegant vision: read the “text” of the DNA and understand the inner workings of the cell. But the whole idea could be dismissed as laughably reductionist.  

Frey acknowledges that there is much ground to cover. The company has a scientific and technological roadmap, he says, for incorporating new information.

“The roadmap is basically to incorporate biological processes one at a time as we understand them better, so polyadenylation and enhancers, transcription, RNA stability, translation, protein localization within the cell, a chromatin interaction. So there’s all sorts of different cell processes and we have a plan to incorporate those into our system as time goes in the next couple of years.”

The data needed are vast. Even so, Frey seems genuinely flummoxed that his approach is a new one.

“What I’ve concluded is that my team has this unique expertise in machine learning and genome biology that allows us to combine the two together and have this unique framework for building this computational system. And what I’ve realized is that—surprisingly—other labs, they’ve kind of given up. They’re taking the shallow learning approach, and because they’re using shallow learning they can’t really figure out how the language is complex and how it works.”

He’s not having too much trouble making converts.

Yann Lecun, director of Artificial Intelligence at Facebook, admitted to the Washington Post that the work thus far was a “baby step,” yet he joined the Deep Genomics Scientific Advisory Board after hearing Frey speak last December. Also on the SAB: Stephen Scherer, director of The Centre for Applied Genomics, The Hospital for Sick Children; and Jordan Lerner-Ellis, director of molecular diagnostics at Mount Sinai Hospital.

Frey believes he’s converting other technology visionaries as well. “I met with the executive team at Illumina last week and they’re completely onboard with what we’re doing. I met with people at Google Life Sciences and they get it,” he said.

He expects the field to become more crowded soon. “There’s a sea change coming,” he predicts. “People are going to be focusing now on the machine learning component and trying to understand what the genome means, not just sequence a bunch of genomes.”

First Target

In the future, Frey believes his system will play a role in pharmaceutical development, creating patient networks, and precision medicine. But for now, Frey has his sights set on the genetic testing market.

He likens current genetic tests to glorified literature searches. Mutations are found, and then compared to the published lists. If it’s a known mutation, the standard of care is prescribed. But if it’s a novel mutation, there’s a lot of uncertainty. Frey says Deep Genomics can offer its software as a service for genetic testing companies to illuminate these unknown mutations.

“Our system would be able to say, this mutation’s new. It’s never been seen before in a cancer patient. However, it causes a change in a protein level that’s just the same as all these other mutations that have been linked with cancer,” Frey says. “So we can say therefore this is evidence that there’s a similarity in mechanism. Or our system can say, well there’s this new mutation, but our system predicts that it doesn’t cause any changes in protein levels that have been seen before in these patients with cancer.”

Meanwhile, Frey is not worried about competing for funds. Deep Genomics already has revenue streams. Frey lists Human Longevity as a client and names SynapDx and the Toronto Center for Applied Genomics as “partners”. Along with the initial angel investments and an increasing revenue stream, Frey believes the company is well-positioned for the next six months, at least.

“A lot of people who want to put money into the company. And right now actually I have been just sort of holding back on them,” Frey says. “You know, I’m sort of telling them that I’ll get back to them soon just because we have enough money to do what we want to do.”