Platform Reduces Barriers Biologists Face In Accessing Machine Learning

August 1, 2023

By Deborah Borfitz 

August 1, 2023 | A group of scientists at the Wyss Institute for Biologically Inspired Engineering at Harvard University and MIT are convinced that automated machine learning (autoML) is going to revolutionize biology by removing many of the technical barriers to using computational models to answer fundamental questions about sequences of nucleic acids, peptides, and glycans. “Machine learning can be complicated, but it doesn’t have to be, and sometimes simpler is better,” according to graduate student Jackie Valeri, a big believer in the power of autoML to solve real-world problems. 

AutoML is a method learning concept that helps users transfer data to training algorithms and automatically search for the best ML architecture for a given issue, lowering the demand for expert-level computational knowledge that currently outpaces the supply. It can also be “pretty competitive” with even the best manually designed ML models that can take months if not years to develop, says Valeri, as she and her colleagues recently demonstrated in a paper published in Cell Systems (DOI: 10.1016/j.cels.2023.05.007).  

The article showcased the potential of their novel BioAutoMATED platform which, unlike other autoML tools, accommodates more than one type of ML model and is designed to accept biological sequences. Its intended users are systems and synthetic biologists with little or no ML experience, says Valeri, who works in the lab of Jim Collins, Ph.D. at the Wyss Institute. 

The all-in-one BioAutoMATED platform modifies three existing AutoML tools—AutoKeras, which searches for optimal neural networks; DeepSwarm, which looks for convolutional neural networks; and TPOT, which hunts for a variety of other, simpler modeling techniques such as linear regression and random forest classifiers—to come up with the most appropriate model for a user’s dataset, she explains. Standardized output results are presented as a set of folders, each associated with one of those search techniques, revealing the best performing model in graphic and text file format. 

The tool is “very meta,” says Valeri, in that it is “learning on the learning.” Model selection is often the part of research projects that requires a lot of computational expertise biologists generally do not possess and the task can’t be easily passed to an ML specialist even if one is to be found because domain knowledge is needed in the model-building process. 

Overall, biological researchers are excited about using machine learning but until now have been “stymied by the amount of coding needed” to get started, she says, noting that it is not uncommon for ML models to have a codebase of over 750 lines. “The installation of packages alone can be a huge barrier.” 

Interest in ML has skyrocketed over the past year thanks largely to the introduction of ChatGPT with its user-friendly interface, but people have also quickly discovered they can’t trust everything the large language model has to offer, says Valeri. Similarly, BioAutoMATED is useful but not a “magic bullet” that erases data problems and like ML in general should be approached with a healthy amount of skepticism to ensure it is learning what’s intended. 

BioAutoMATED will in the future likely be used together with ChatGPT, predicts Wyss postdoctoral fellow Luis Soenksen, Ph.D., co-lead author on the Cell Systems paper. Researchers will simply articulate what they want to do and be presented with the best questions, required data, and ML models to get the job done.  

Features and Outputs  

When put to the test, BioAutoMATED not only outperformed other autoML tools but also some of the models created by a professional ML expert—and did it in under 30 minutes using only 10 lines of input code from the user. The required coding is for the basics, says Valeri, to specify the target folder for results, the file name where input data can be found, the column name where sequences can be found within that file, and run times for these extensions. 

Users are instructed to first install Docker on their computer, if they have not done so already, and are walked through the process of doing that, she adds. The open software platform sets up its own environment for running applications, requiring only two lines of code to access the Jupyter notebooks preloaded on BioAutoMATED that contain everything needed to run the autoML tool. It’s a “quick start” for most people accustomed to using a computer. 

With a bit more coding, users can access some of the embedded extras, says Valeri. These include the outputs from scrambled control tests where BioAutoMATED generates sequences by shuffling the order of nucleotides, answering the frequently asked question of whether models are picking up on real order-and sequence-specific biology. 

“Half of the battle in biological research is knowing how to ask the right questions,” says Soenksen. The platform helps users do that as well as provides insights leading to new questions, hypotheses, models, and experiments. 

Users can also opt for data saturation tests where BioAutoMATED sequentially reduces the dataset size to see the effect on model performance, Valeri says. “If you can say the models do great with 20,000 sequences, maybe you don’t have to go to the effort of collecting 50,000 or 100,000 sequences, which is a real impactful finding for a biologist actually doing the experiments.” 

Two of the most exciting outputs from the tool, in Valeri’s mind, are the interpretation and design results. Interpretation results indicate what a model is learning (e.g., nucleotides of elevated importance), including “sequence logos” where the larger the size of the letter in the sequence the more important it is to whatever function of interest is being examined. Sequence logos of the raw data can also be done to facilitate comparisons across ML tools. 

Biologists using BioAutoMATED in this way can expect some actionable outputs, says Valeri. They might want to pay more attention to a motif that pops up through all these sequence logos, for example, or do a deep mutational scanning of a targeted region of the sequence that appears to be most important.  

The other key output is a list of de novo design sequences that are optimized for whatever function the model has been trained on, she says. For the newly published study, this focused on the downstream efficiency of a ribosome binding site to translate RNA into protein in E. coli bacteria. 

BioAutoMATED was also used to identify areas of the sequence most important in determining translation efficiency, and to design new sequences that could be tested experimentally. Further, the platform generated highly accurate information about amino acids in a peptide sequence most critical in determining an antibody’s ability to bind to the drug ranibizumab (Lucentis), as well as classified different types of glycans into immunogenic and non-immunogenic groups based on their sequences. 

Finally, the team had the platform optimize the sequences of RNA-based toehold switches. This informed the design of new toehold switches for experimental testing with minimal input coding required.  

Early Adopters 

The time it takes to obtain results from BioAutoMATED depends on several factors, including the question being asked and the size of the dataset for model training, says Valeri. “We’ve found the length of the sequence is a really big factor... and the compute resources you have available.” 

“The maximum user-allowed time for obtaining results is another important consideration”, adds Soenksen. The platform can search for hours or days, as circumstances dictate. Time constraints are routinely employed when training ML models as a matter of practicality. 

Soenksen and Valeri both use BioAutoMATED as a benchmark for their own custom-built models, and friends that have tested the platform on different machines are enthusiastic about its potential, they say. In the manuscript, the platform also had good performance on many different datasets, including ones specific to sequence lengths and types. 

“I have personally used it for some quick paper explorations, trying to see what data are available... [without] having to take the time to code up my own machine learning models,” says Valeri. Although it is too soon to know how the tool will be used by biologists elsewhere, it is already being used regularly by a handful of scientists at Harvard investigating short DNA, RNA, peptide, and glycan sequences.  

BioAutoMATED is available to download from GitHub. “If we get a lot of traction [with it], and I think we will, our team will probably put more resources into the user interface,” notes Soenksen, a serial entrepreneur in the science and technology space. The long-term goal is to make the tool usable by clicking buttons to further lower barriers to access. 

“If you’re a machine learning expert, you’ll probably be able to beat the output of BioAutoMATED,” adds Valeri. “We are just trying to make it easy for people with limited machine learning expertise to [quickly] get to a pretty good model.”  

Complicated neural networks and big language models, which have a lot of parameters and require large amounts of data, are “not always best,” she says. The simple-model techniques identified by TPOT can be quite well suited to the often-limited datasets biologists have available and can perform as well as if not better than systems with more advanced ML architecture.