Generate Biomedicines’ R&D Data Platform to Support ML-Driven Drug Discovery

By Allison Proffitt

August 8, 2023 | “Generate Biomedicines’ goal of leveraging machine-learning powered generative biology to discover and develop new drugs faster and cheaper is supported by an integrated and cohesive R&D data platform… FAIR to its core, this integrated data platform leverages industry best practices and lessons-learned from a talented team of informatics engineers to form the critical foundation of a unique company with equal parts ‘bio’ and ‘tech’.”

That’s how the nominating team described the entry from Generate Biomedicine in the 2023 Bio-IT World Innovative Practices Awards program and our judges agreed. Generate Biomedicine was named one of five winning projects for 2023.

As a company launched in 2020 by Flagship Pioneering, Generate has the distinct advantage of not having legacy systems or data to contend with—an acknowledgement Stephen Kottmann, senior director, Informatics, Engineering and IT, freely made in his Innovative Practices presentation. But since the company’s launch, the system has supported 50+ projects; 55,000 proteins; 11,000 platemaps; 70 production assays; 1,800 assay runs; and 1.8 million results. A guiding principle is to pick a few things and be great at them, Kottmann advised, quoting Dimitris Agrafiotis, the company’s Chief Digital Officer. And even with a “born digital” advantage, the principles of Generate’s architecture and the other tools they’ve brought in can still be emulated by both new and more established companies, Kottmann—and the judges—believe.

Generate’s flywheel is the “Generate-Build-Measure-Learn” cycle, explained Kottmann. At Generate, proprietary and public data are combined to build generative models that design novel proteins with desired form and function. Seamless, digital hand-off of these design variants is key to the company’s platform, he said. Manual handoff and data manipulation are expensive, inefficient, and error prone. Design variants pass to protein engineering; synthesized proteins are characterized via project- and workflow-specific assay cascades; and samples are tracked through each step of the process. Data from biophysical characterization, developability, and functional assays are captured to be analysis- and ML-ready, as well as centralized, organized, and managed to be findable and accessible to humans and machines alike.

The core Generate data platform is anchored on Benchling’s ELN, Bioregistry, and Inventory solutions, Kottmann explained. Those tools are augmented and amplified with Generate’s highly flexible, extensible, and event-based data modeling service, with bi-directional synchronization between the two core solutions. This dynamic data modeling capability has supported the rapid and agile evolution of a data model that currently has approximately 250 entity types with 2,500 properties, allowing us to comprehensively register and track lab activities. Role- and team-based access controls (also mapped and synced between systems) impose limitations on what entities, properties, and records can be accessed and updated in either system. All access to data through downstream systems is funneled through this core digital backbone to ensure consistent permissions regardless of access patterns.

For raw data generated in the laboratory, Generate has taken a “right-sized” approach, Kottmann explained. “There are super complicated things you can try to build and do. In our case, we solved this with a really simple little data agent—literally about 100 lines of JAVA code.” That data agent is installed on all lab equipment and monitors each system for new data and copies it to an AWS S3 data lake. Each raw data file is catalogued in the core data management system—“decorated” with all applicable metadata—to create an integrated, browsable interface to the data lake.

To plan experiments, Generate developed an internal, web-based graphical user interface representing plate layouts. More detailed and more flexible than an inventory system, the tool allows users to annotate plates with both standard and custom annotations. Registered platemaps are integrated into a variety of workflows and can be used to interact with a wide variety of automation equipment by directly generating worklists to drive liquid handlers or sample lists for instruments.

Generate’s ML models need high-quality, robust, repeatable, contextualized data for training and developing the next iteration of models, the company explained in their entry. “To this end, our assay data capture, QC, and publication tool ties everything together to register FAIR data from each of our production assays. This custom solution uses a toolkit-based architecture where individual component methods for assay data processing (parsing, normalizing, curve-fitting, etc.) are implemented as reusable service endpoints, which can be combined via configuration to form repeatable and consistent protocols for assay workup.”

New production assays are onboarded via a special operating committee at Generate: the Assay Data Committee. This team has representatives from across the organization and helps guide new assay owners through a process that ensures each assay has appropriate controls and conditions, as well as a plan to track and assess assay data quality over time. “This standardized process, along with the toolkit architecture of the assay data capture tool allows us to onboard new production assays within a few days,” the company writes.

Data Exploration

When it comes to exploring data, scientists want programmatic access to their data, Kottmann said. Generate built three levels of access to meet users where they are. Pipelines are predefined, containerized, versioned computer workflows that are launched via web interface and run on EKS infrastructure. Workspaces use JupyterHub for cloud resource allocation and have custom launch templates for ML workflows for ad-hoc computational science.

Finally Flows uses Prefect as an orchestration layer on top of EKS. “Sort of the cool novelty here,” Kottmann added, “is an intermediate layer called Generate-HPC that can dynamically allocate to a heterogenous cluster.” Generate ML scientists code in Python, he explained, and they don’t need to know how EKS or AWS work. They simply annotate functions with their compute needs and the system, “will dynamically allocate worker pods onto EKS, run that function, and then return back into the centralized manager of Prefect.” In addition, Generate uses Prometheus, Grafana, and Kubecost for reporting and costs. For each tool, everything is tied together via a request fulfillment system which builds the whole workflow as a DAG of dependent tasks, tracking all details and notifying each team member on Slack of next steps.

Rinse, Repeat

While Generate has built quite a tech stack, Kottmann still advocates for a narrow view. “It’s very easy to get distracted by the latest and greatest technologies,” he said. Stay focused, he recommends.

“Invest in generalization and templating. A lot of us have been presented with these times where you’re asked to do something: I could either pound this out in Excel, or [I] could write a script. You make this mental calculus: it’s going to take me longer to write the script, but I might have to do this again... I have been surprised by how many examples there are of things that can be scripted and templated and generalized and used over and over again.” Take the time to automate, he advises. And when you do, implement Continuous Integration and Continuous Delivery (CICD) every step of the way, he adds. “We don’t deploy anything in our organization that does not go through a CI/CD pipeline.”