Discovery To Creation: Highlights from the 2023 Single-Cell and AI in Medicine Symposium

August 2, 2023

By Randall C. Willis with review and comments from Fabrice Chouraqui, Laurens Kruidenier, Chad Nusbaum, Diogo Camacho 

August 2, 2023 | “Technology and computation are revolutionizing the way we are creating and developing new drugs,” Cellarity CEO Fabrice Chouraqui announced to open the Second Single-Cell and AI in Medicine Symposium on May 4 in Boston, hosted by the company. Much like Cellarity itself, he suggested, the field of drug discovery is shifting its approach from viewing complex biology as something to be overcome using reductionist principles to something to embrace in a more holistic approach to cell biology. 

Although traditional approaches to drug discovery have been successful, company CSO Laurens Kruidenier continued, there are clearly clinical indications where new approaches are required. He offered the example of non-alcoholic steatohepatitis (NASH), where despite the completion of more than 50 clinical trials against 30 different targets over the last five years, no therapeutic candidates have received FDA approval. 

Rather than focus on single molecular targets, Kruidenier explained, Cellarity focuses on the cell in its entirety, arguing that the cell is a better proxy for disease than any single target and artificial intelligence stands to transform drug discovery from a serendipitous process to a design process. For him, it is time for drug discovery to become drug creation. 

“With this new approach,” he pressed, “we will access new biology and address diseases that are currently not addressable, remove a lot of the traditional constraints in the R&D process, find new chemistry, and do this faster and cheaper.” 

Finding New Biology 

The opening session of the afternoon focused on the application of single-cell technologies to reveal previously untapped biology. 

The Broad Institute’s Fei Chen ( suggested that much of the molecular characterization of cells—e.g., genomics, transcriptomics, epigenomics—had been determined completely out of the context of the tissues from which those cells arose. Would it be possible, he wondered, to perform that characterization within the tissue context? 

He recounted efforts to build a complete cell atlas of the mouse brain using traditional single-cell techniques on brain sections. Six million single-cell profiles identified 5,000 different cell types, of which most were neurons. In parallel, his group performed spatial transcriptomics on 101 sections of a single mouse brain, evaluating RNA expression at any X/Y coordinate. 

“The striking thing from this is that most cell types are extremely localized,” Chen noted, adding that despite 95% of the neuroscience literature focusing on the cortex, hippocampus and cerebellum, most of the cell types they identified were subcortical. (Langlieb, et al. (2023) bioRxiv. Doi: 10.1101/2023.03.06.531307) 

“There is an immense diversity of cell types that are not in the regions of the brain that we normally study,” he continued. “This is an important opportunity to figure out what these cells are doing; how do we target them; and how do we drug them?” 

Rather than perform this analysis in two separate protocols, however, Chen’s group took inspiration from the triangulation capacity of GPS to develop a technology that would allow them to both localize cells within a tissue and characterize them. (Russell, et al. (2023) bioRxiv. Doi: 10.1101/2023.04.01.535228) The slide-tag system relies on spatially arrayed beads carrying unique photocleavable DNA barcodes. Tissue sections are then applied to the arrays and the barcodes released, diffusing into cells. The diffusion becomes a measure of the distance between that bead and the cells around it. 

“At the end of the day, you get two libraries,” Chen explained. “One, the gene expression library that tells you the counts of the transcriptome, and the spatial barcode library. Together, they allow you to put the cell types & data back into the context of the tissue.” 

The method, he insists, is also applicable to other methods such as chromatin accessibility, TCR sequencing, copy number, and single-cell methods, as well as multi-omic measurements. 

Samantha Morris of Washington University of St. Louis (, meanwhile, wanted to apply single-cell analysis and machine learning to understand more about the factors influencing cell identity with a goal of generating clinically valuable cell types. In 2014, her group created CellNet, a network-biology-based platform designed to compare the identities of engineered cell types against their in-vivo correlates. They found their engineering efforts lacking. 

“When we try to take one fully differentiated cell type, over-express transcription factors and push it toward another fully differentiated cell type, we still see the original cell identity there,” she suggested. “We go completely off-target. We seem to go back into developmental origins for these cells, and we are trying to avoid that with these protocols. Fundamentally, this limits the practical utility of these engineered cells.” 

Where CellNet relied on bulk expression data, the group has since developed the Capybara platform, performing the same analysis to measure cell identity and cell fate transitions at single-cell resolution. (Kong et al. (2022) Cell Stem Cell. 29, 1-15) 

“We tend to get this smear of a change in cell identity,” Morris said. “It’s non-physiological. We don’t get discrete cell states from these protocols, so we consider that each single cell represents an identity on a continuum.” 

“It’s made us take a step back and think about the gene regulatory logic of reprogramming, and we’re still a long way away from this. We want to understand how cell identity is regulated, how transcription factors are controlling identity.” 

To do that, Morris’ group developed CellOracle (, a machine-learning platform that leverages gene regulatory network (GRN) models to simulate cell perturbation in response to transcription factors. (Kamimoto et al. (2023) Nature. 614, 742-751) The group systematically performed transcription factor knockout simulations for every transcription factor in haematopoiesis and found that 85% of the CellOracle predictions were supported by published phenotypes. But could CellOracle discover new biology? 

The group turned to zebrafish development and were able to experimentally confirm predicted phenotypes for about 230 transcription factors, including new molecules they can add to their base cocktails to improve cell differentiation. Morris acknowledges, however, that CellOracle is currently limited to cell types or cell states it has seen. These are also linear models, so the system can only predict outcomes from knockout or overexpression of one transcription factor at a time. 

To improve the process experimentally and computationally, she co-founded the company Capybio ( 

The University of Pennsylvania’s Sydney Shaffer ( used cellular barcoding to monitor the progression of disease at the single-cell level, focusing her attention on the transition of healthy esophageal tissue to esophageal cancer through an intermediate step called Barrett’s esophagus. Rather than add the barcodes before characterization, however, they wanted barcodes integrated into the genome so that when the cell divides, the barcode is passed on. This would allow them to connect a cell from a given timepoint to a clinical outcome, such as drug resistance. Where researchers like Fei Chen used exogenous barcodes, Shaffer’s group relied on endogenous markers, mutations within the mitochondrial genome and something intrinsic to patients long before the onset of disease. 

“Thus, you can do scRNA-seq and mitochondrial DNA sequencing from the same samples giving the cell transcriptional states as well as the progenitor relationships between cells,” she explained. 

Using single-cell lineage tracing and transcriptional profiling, Shaffer’s group showed that Barrett’s metaplasia was polyclonal, arising from multiple progenitor and differentiated cell types, whereas precancerous dysplastic tissue arose from expansion of a single Barrett’s esophagus clone. (Gier, et al. (2023) bioRxiv. DOI: 10.1101/2023.01.26.525564) The expression data also pointed to involvement of Wnt pathways in dysplasia, whole exome sequencing identifying a mutation in APC, a member of that pathway. 

Seeking Master Regulators 

The second session of the day focused largely on efforts to leverage GRNs both to understand normal cell physiology and to improve treatment of disease. 

Setting the bar quickly, Columbia University’s Andrea Califano ( argued that because there are more mutational patterns in cancer and other complex diseases than atoms in the universe, going after disease one mutation at a time made little sense. 

“We [therefore] decided to focus on the proteins that are responsible for integrating the effect of the mutation,” he said. “Then, we can establish and maintain homeostatically the cellular state, either pathologic or physiologic.” 

The master regulators don’t work in isolation but rather in modules that are tightly auto-regulated. Thus, rather than target the mutated genes or proteins, he argued for the targeting of the modules. He highlighted the challenge through an analysis of pancreatic cancer. Using VIPER, a platform that infers protein activity within a regulome based on gene-expression data, researchers showed that the cells occupied six different states with distinct protein expression profiles, which Califano suggested explained the challenges of treating pancreatic cancer. (Alvarez, et al. (2016) Nat Genet. 48, 838-847) “You don’t have to take care of one population, you have to take care of six,” he explained. 

Califano then described the extension of these efforts, incorporating drug perturbation analysis to create a platform called OncoTreat (Alvarez et al. (2018) Nat Genet. 50, 979-989, 2018; Mundi et al. (2023) Canc Discov. DOI: 10.1158/2159-8290.CD-22-1020). 

“We now generate about 40,000 profiles at Columbia University and maybe 200,000 profiles at Darwin Health ( of drug treatments in cells that recapitulate the master regulators of clinically relevant tumor states,” he noted. “We don’t use cell lines as surrogates for the response to a drug. We use them simply to elucidate the mechanism of action of the drug.” 

Before and after treating cells with a drug, the researchers measure the activity of the master regulators. Their goal is finding a compound that reverses the activities of those master regulators that maintain tumor state of cells. 

Already, he said, they have seen success with this approach as shown in the clinical trial of an HDAC6 inhibitor and nab-paclitaxel in metastatic breast cancer (NCT02632071), where every patient who received the inhibitor responded in a clinically relevant manner to treatment. 

Califano then asked: Can we do the same thing in single cells, and can we do it not only in the transformed compartment but also in the tumor microenvironment (TME) compartment because both contribute to the pro-malignancy state of the tumor? 

Trying to target breast cancer cells with stem-like properties, they performed OncoTreat predictions at the single-cell level and tested drugs in PDX tumor models. They found that whereas paclitaxel depleted cells that were more differentiated, leaving stem-like cells unaffected, albendazole had the opposite effect. 

“We thought, why not combine the two drugs,” he recounted. “Basically, do multiple cycles of albendazole and paclitaxel, and hopefully you’ll prevent the tumor from coming back and regenerating. You’re also killing the cells that would otherwise kill the mouse.” 

In four of six models receiving both drugs, tumors became cytostatic or showed slowed growth, whereas all of the models on monotherapy saw exponential tumor growth. 

Analyses with VIPER also highlighted significant differences in master regulator expression between tumor-infiltrating regulatory T cells (Tregs) versus those in circulation (Obradovic et al. (2023) Canc Cell. 41, 933-949). OncoTreat was used to predict what drugs would influence these profiles, and they found that gemcitabine was able to reverse Treg phenotype at concentrations 10-fold lower than what is used in the clinic. They also saw a dramatic response in mouse tumor models. Such studies can be vital, Califano explained, to preventing tumor cells from escaping treatment. 

“Instead of targeting one population using synergistic drugs in that population, we’re going to implement synergy by targeting co-existing but molecularly distinct subpopulations within the same tumor,” he pressed. “That doesn’t need to be done by co-administering the drugs, which could lead to significant toxicities. You can actually use one drug, then another drug, then another drug, then take a holiday, and then start the cycle again.” 

Califano also introduced OncoLoop, which profiles the master regulators of a patient’s cancer to facilitate selection of genetically engineered mouse models (GEMM) that better recapitulate the tumor. (Vasciaveo et al. (2023) Canc Discov. 13, 386-409) 

“OncoLoop showed that when you target master regulators, you abrogate the ability of the tumor to create its own microenvironment that is immunosuppressive,” he explained. “So now, tumors like prostate cancer, which are intrinsically very resistant to immunotherapy, becomes strikingly sensitive. So, there’s complete abrogation in GEMM models of all metastases, both in the bone and in the liver.” 

Zev Gartner from University of California, San Francisco ( used similar approaches to understand the dynamic and self-organizing capacities of tissues, both from a homeostatic and diseased perspective. 

“For at least 100 years, if not longer, pathologists have been looking at the structure of tissues to diagnose diseases,” he started. “So, I think it stands to reason that a change in tissue structure is somehow intricately linked to disease pathology. Unfortunately, we don’t know how it works. Why is tissue structure changing, and what regulates that?” 

To perform single-cell analysis and determine the underlying GRNs, Gartner’s group developed a barcoding strategy called MULTI-seq, which relies on cell-surface modification with lipid-modified oligonucleotides (LMOs). These LMOs are comprised of two oligos that hybridize, and which are modified with fatty acids to make them stick to the cell surface. A third oligo with the barcode is then introduced, which hybridizes to the cell-restrained oligo pair. 

One of the challenges they encountered, however, was that emulsion droplets not only contained directly barcoded cells but were also sometimes contaminated with off-target barcodes that adhered to cells or with free-floating barcodes. To help deconvolute these samples, Gartner and his team trained an algorithm, deMULTIplex2, to determine whether a specific combination of barcodes was on- or off-target. (Zhu, Conrad, Gartner (2023) bioRxiv. DOI: 10.1101/2023.04.11.536275) In some cases, he noted, they improved the ability to classify cells into the right sample by 60-fold.  

They turned their approach to the challenge of understanding villus morphogenesis in embryonic mouse development, identifying a set of networks that, when upregulated, made the cells more cohesive as well as other networks that allowed cells behave more fluid-like, able to move around and form clusters. They also examined the role of hormones in regulating the dynamics of breast tissues, particularly during the menstrual cycle, which closely correlates with the risk of breast cancer. (Murrow et al. (2022) Cell Syst. 13, 644-664) Using tissue samples from breast reduction surgery patients, they were able to reconstruct cell state changes happening across the hormonal cycle. 

Gartner’s group is now trying to adapt these strategies to reveal mechanisms to modulate different tissue systems, pointing to immune system regulation as a particularly attractive target. (Jiang, et al. (2023) bioRxiv. DOI: 10.1101/2023.04.19.537364) 

Harvard University’s Marinka Zitnik ( took a step back from experimentation to examine the larger problem faced by the community, starting with the staggering reality that of the estimated 1060 compounds that might have drug-like properties, only a tiny fraction is being investigated and an even smaller fraction (105) has received FDA-approval. 

“Is there a way that we can leverage modern data resources, automation, and AI to help expand that space?” she asked. What we need, she continued, are algorithms and methods that not only accurately predict within the previously observed space but also can extrapolate beyond that to identify novel designs, structures, and biological contexts. 

The Therapeutic Data Commons ( is a place where this can begin, she suggested. (Huang, et al. (2022) Nat Chem Biol. 18, 1033-1036) She described this as a meeting point for biomedical and biochemical scientists, on the one hand, who can identify bottlenecks throughout the drug discovery and development pipeline, as well as across therapeutic modalities, and machine learning and computational scientists, on the other, who can design systems to help resolve these bottlenecks. 

Zitnik pointed to TxGNN (, a very large graph neural network that facilitates therapeutic use prediction across a large array of diseases and therapeutic candidates, leveraging knowledge about genes, phenotypes, diseases, and molecular compounds, as well as existing drugs and their mechanisms of action. Once trained, she said, the system can offer insights about indications and contraindications across the entire set of diseases. (Huang, et al. (2023) medRxiv. DOI: 10.1101/2023.03.19.23287458) 

One challenge, she suggested, was that although current systems can make very accurate predictions, those predictions are often clinically trivial as researchers could have made them without the models. It’s relatively easy to predict efficacy of a drug in diseases with many treatment options, she explained, but much harder when there are no treatments available. 

To test TxGNN, Zitnik’s group compared its predictions to electronic health records (EHRs) of off-label prescribing in a large healthcare system, believing that indication or contraindication predictions should align with clinical experience. They found that TxGNN more closely matched clinical experience than other models for indications (49.2% higher) and contraindications (35.1% higher). It also predicted therapeutic uses that aligned with recent FDA approvals. 

Another challenge Zitnik’s group is trying to address is that predictions are of limited use without some sense of the context in which that prediction is true and actionable. Such information, she argued, would help researchers understand what downstream experiments would be required to test those predictions. To that end, she described AWARE, a deep-learning approach that contextualizes molecular cell atlas data with protein interaction networks. (Li & Zitnik (2021) arXiv. DOI: 10.48550/arXiv.2106.02246) The algorithm currently connects almost 400,000 proteins across 156 cell types within 24 tissues and organs. When combined with drug perturbation data, she suggested, the system allows researchers to identify protein targets and cellular contexts for further investigation.   

In a theme that was picked up later in the panel discussion, however, Zitnik also suggested that it was not enough to simply develop algorithms. It was also critical to design user interfaces that allowed users to freely ask their questions without requiring expertise in coding or computational science. Toward this end, she offered TxPLM, a language model trained on the English language, allowing experts to type their queries in plain text and receive amino acid sequences or protein domains that might be functional for the phenotype described. 

Panel Discussion 

Although several of the final panelists spoke of the importance of the technological and computational innovations exemplified by the earlier talks to transitioning drug discovery to drug creation or design, to a person, they all acknowledged the vital importance of cross-disciplinary collaboration to making this dream a reality. 

“Finding new biology is slowly being taken out of the exclusive realm of biologists, which is a good thing,” Cellarity’s Laurens Kruidenier offered. “I often felt when I was doing target discovery in Big Pharma, we were often very biased by what we knew or thought that we knew, which is of course very little. Now, when I let loose a bunch of computational folks on the same datasets, we get completely different insights, right or wrong. To me, it’s very important that we take the bias out, and I think computational methods can help us do that, basically expanding our brain in that way.” 

Key to the success of such collaborations, added Jonah Cool of the Chan Zuckerberg Initiative, was having a clear and shared goal to which the group could anchor itself and relate. 

“Maybe people think about it and describe it in a different language, but ultimately, everyone at the end of the day can understand it,” he suggested, adding “The other thing that is really important is enough flexibility where the group can bob and weave toward it. There is a balance there.” 

Such cross-disciplinary integration efforts can become more challenging, however, as organizations and companies scale in size, several speakers acknowledged. 

“Within a smaller company or unit within a company, it can be easier to integrate,” Pfizer’s Enoch Huang noted. “As you get larger, you can’t all be in the same space. And with scale, comes specialization. You run the risk of losing integration.” 

As suggested in a couple of the earlier talks, data also continues to be a bottleneck for machine learning analysis. For example, Genentech’s Tommaso Biancalani suggested that his organization liked foundational machine learning models but found that with much smaller datasets than occur in other technological areas, the models were more limited in drug discovery and development. 

Complimenting Zitnik’s efforts with EHRs, Kruidenier lamented the current state of EHR metadata curation and standardization—a sentiment echoed by Huang—particularly when applied to questions like patient response to drug treatment. 

“It’s not a simple yes or no,” Kruidenier said, explaining the confounding factors. “How long was the patient on treatment? Did they have the time to respond? Was the drug given at the right dose? So many variables. There is so little standardization of publicly available metadata.” 

Thus, the problem can be less about data availability than quality or useability. 

“The largest investment in time aside from developing systems is the funding and resources required to curate data,” Cool suggested. “Sure, you could make the data available, but the time and the resources required to curate it, describe it, make it useful is just not worth the resources and so you don’t.” 

Further, there is always a question of what data to release publicly, continued Biancalani, as company researchers need to find a balance between being a collaborative member of a community and ensuring their companies stay in business. 

“All the software should be open-source, all the machine learning algorithms should be released in papers, but the data are a bit trickier,” he stated, relating that he has participated in conversations within Genentech that decided what data needed to be kept private and what could be released without jeopardizing intellectual property. “Sharing models is good. Sharing data, some.” 

Living In The Future Now 

The keynote presentation by Harvard Business School’s Karim Lakhani presented the audience with a taste of things to come and something of a challenge to change their thinking about the challenges that remain for bringing to life the dream of drug creation replacing serendipitous discovery. Lakhani suggested that pharma and biotech are on the cusp of completely transforming their business and operating models and he pointed to examples from other industries. 

“I hear the issues about data,” he said, “but I would say you have applied a laser focus on asking across our discovery operations, our manufacturing operations, and also importantly, on the patient side, how do we bring data across the way?” 

Despite the conversations he heard earlier in the day, he said there was still too much data being siloed and not enough collaboration. 

“We want an operating model where data cuts across the entire enterprise,” he pressed. “We don’t want an oncology silo and a neurology silo and a respiratory silo. And we want to delay customization to the very last minute.” 

For many organizations, he added, AI is mostly a craft-based operations that he said resembled the couture houses of Italy and France. The algorithms are hand-built, customized for one purpose. They’re one-offs. They’re not scalable, he challenged, and the best AI is doing is building a dashboard. 

AI needs to be industrialized, he argued, if its true power was to be realized. Data and AI tools need to be shared across every operation, creating an AI factory. 

Much as many of the speakers suggested biomedical, biochemical, machine learning and computational specialists could learn a lot from each other, Lakhani widened that call, perhaps setting the table for the 2024 Single-Cell and AI in Medicine Symposium.