Take the DNA Delorean: The Promise of Large Language Models in Genomics

Contributed Commentary by Vanessa Braunstein, NVIDIA

January 27, 2023 | Elucidating information directly from DNA to understand RNA's and human proteins' downstream structure and function is the holy grail of nearly a century of molecular biology research. The central dogma of the field—that DNA is transcribed into RNA before translation into protein—describes in clear linguistic terms the way our cells function.

Today, large language models (LLMs) such as GPT-3 and Megatron Turing NLG demonstrate remarkable abilities to interpret, translate, summarize, and generate human languages. Next-generation sequencing has propelled us from the $3B first human whole genome sequence to a reality in which companies like Ultima tout AI-powered sequencers capable of whole-genome sequencing at $100 per individual. Now, we are finally generating the amount of data required to train such models in the language of biology.

This drop in the cost of sequencing has already allowed us to discover genes and variants associated with specific, rare, chronic, or cancerous diseases. Fast, accurate deep learning models for detecting variants have been developed with immediate applications in critical care settings, such as neonatal intensive care units using DeepVariant in the Guinness World Record for the fastest DNA Sequencing Technique led by Stanford in a critical care setting. What does this confluence of affordable sequencing, accelerated computing, and powerful AI architectures mean for biology today?

Today, genomic instrument companies such as Oxford Nanopore Technologies, PacBio, Singular, and Ultima have publicly announced using graphics processing units inside their sequencing platforms for AI-based base calling. These models span CNN, RNN, and transformer-based AI models, including DeepConsensus in PacBio’s instruments which uses gap-aware sequence transformers to correct errors and enable read accuracy (Nature Biotechnology, DOI: 10.1038/s41587-022-01435-7).

AI has helped accelerate variant calling, variant filtering, and base calling in genomic instruments and analysis, but what about in other areas that include predictions? LLMs are AI models built on transformer architecture, and their application to DNA, RNA, and proteins is a burgeoning field in genomics. The transformer architecture, which adopts the mechanism of self-attention and differentially weighting the significance of each part of input data, has revolutionized natural language processing (NLP) and computer vision.

Like LLMs in NLP understand words, LLMs in genomics understand DNA represented as text sequences of its components (i.e., nucleotides A, T, C, or G). These genomic LLMs are also called DNA language models or nucleotide language models.

LLMs in genomics are trained on a large corpus of labeled genomic data and have demonstrated the ability to accurately predict pathogenic variants and gene expression. While protein LLMs have been heavily investigated for their ability to understand protein structure and function, LLMs applied to DNA are still getting started and offer new ways for scientists to interpret non-coding DNA regions, predict variants of viruses (e.g., for prioritization during vaccine development), and to better understand differential gene expression between healthy and cancer tissues.

Compared to the vocabulary of 20 amino acids and an average sequence length of 350 amino acids for proteins, genomic LLMs operate on a vocabulary of four nucleotides and very long sequences—the haploid human genome is three billion nucleotide pairs. These sequences include important long-range dependencies where a change at the beginning of the genome may also influence behavior in the middle, requiring significant domain adaptation.

Genomic LLMs Show Promise As Interpretive and Generative AI

At this year’s SuperComputing conference, we shared the Gordon Bell special award with more than two dozen academic and commercial researchers from Argonne National Laboratory, the University of Chicago, and others. The honored work was a genomic LLM that tracks the genetic mutations and predicts variants of concern in SARS-CoV-2, the virus behind COVID-19. With anywhere from 2.5 to 25 billion trainable parameters, the Genome-Scale language models (GenSLMs) represent some of the first and largest whole genome LLMs trained on over 100 million nucleotide sequences (bioRxiv, DOI: https://doi.org/10.1101/2022.10.10.511571).

Instead of focusing on specific proteins from single genes, GenSLMs learn the entire genome to model individual mutations at the nucleotide scale, thus implicitly accounting for protein-level mutations. GenSLMs is also a generative AI model for genomics. When thinking of LLMs in genomics, it is important to consider the scale of datasets and genomes compared to the faint signal of small variations that significantly impact individuals.

Today, advances in LLMs are coming out almost weekly in natural language, proteomics, and generative chemistry. Genomic LLMs are starting to grow and show promise for interpretation and generative AI to create synthetic data for single nucleotide variants, Insertions/Deletions, copy numbers, structural variants, and fusion calling.

In September of this year, Nature featured a deep generative model focusing on regulatory DNA and predictions of lowest and highest levels of expression in yeast and featured this work on the Nature cover (DOI: 10.1038/s41586-022-04506-6). Using millions of randomly-sampled promoter DNA sequences and their measured expression levels in the yeast Saccharomyces cerevisiae, a deep neural network model was designed to “predict which regulatory mutations affect expression and fitness (when coupled with expression-to-fitness curves), design or evolve new sequences with desired characteristics, determine how quickly selection achieves an expression optimum, identify signatures of selective pressures on extant regulatory sequences, visualize fitness landscapes and characterize mutational robustness and evolvability.”

Transformer LLM for Genomic Enhancers

Enformer—released in 2021—is a deep learning model with a transformer architecture for genomic enhancers that predicts gene prediction from DNA sequences and can integrate information from long-range interactions in the genome (Nature Methods, DOI: 10.1038/s41592-021-01252-x). This model helps scientists understand how noncoding DNA makes decisions about gene expression in different cell types, such as in skin, liver, and heart cells, among others. Genomic models that predict gene expression and chromatin states from DNA sequences hold the promise to better understand transcriptional regulation of messenger RNA (mRNA) from DNA, which is an essential important step before proteins are produced.

Additionally, understanding how genetic variants in human diseases and traits impact expression is an essential tool for synthetic biology and the treatment of disease by gene therapy. Enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input, meaning researchers can supplement or even replace expensive experimental data with Enformer predictions in some cases.

LLMs for scRNA-Seq Data and Nucleotide Data

scBERT—released in September 2022—is another groundbreaking genomic LLM that understands gene-gene interactions and is trained on large corpora of unlabeled scRNA-Seq data (Nature Machine Intelligence, DOI: 10.1038/s42256-022-00534-z). scBERT helps with cell type annotation and novel cell type discovery. Annotating cell types is often a prerequisite for tumor analysis and disease progression projects.

DNABERT—released in 2021—is another genomic LLM that understands nucleotide sequences and can make downstream predictions of promoters, splice sites, and transcription factor binding sites (Bioinformatics, DOI: 10.1093/bioinformatics/btab083). Additionally, DNABERT helps researchers better realize the interpretability and identification of conserved sequence motifs and functional genetic variants.

Genomic LLMs are a novel area showing much promise for understanding genes, variants, coding, noncoding regions, gene expression, methylation, thermostability, and more. Their rise has been enabled by the confluence of an exponential drop in sequencing prices, significantly more powerful accelerated computing architectures, and the development of attention mechanisms and new architectures in the field of artificial intelligence. While they build on the many years of deep learning research in the field of biology, they promise new capabilities, particularly for generating synthetic constructs in silico and predicting sequence properties with experimental-level accuracy.

As we generate more genomic data and train larger models on more powerful hardware, we can likely expect the capabilities of such models to continue improving. The ability of LLMs to outperform both traditional algorithms and previous deep learning architectures on tasks across an array of fields highlights their importance today and their role as a potential basis for the models of tomorrow.

Vanessa Braunstein leads healthcare and life science product marketing for NVIDIA’s Clara products in drug discovery, genomics, medical imaging, medical devices, NLP, and smart hospitals. Previously, she was in product development, business development, and marketing for radiology, genomics, pharmaceutical, chemistry, and bioinformatics companies using AI. She studied molecular and cell biology, public health, and business at UC Berkeley and UCLA. She can be reached at vbraunstein@nvidia.com.