Bio-IT World Virtual: Gamifying Discovery, Human-in-the-Loop AI, New Data Commons Model

By Allison Proffitt and Deborah Borfitz

October 7, 2020 | Day two of the Bio-IT World Conference & Expo Virtual was highlighted by a plenary keynote on artificial intelligence (AI) and citizen science by Northeastern University’s Seth Cooper, the Human Computation Institute’s Pietro Michelucci, Cohen Veterans Bioscience’s Lee Lancashire, and McGill University’s Jerome Waldispuehl that made clear why machines alone can’t take on precision medicine. There are just too many spurious patterns to be found in the messy abundance of patient data to find the causal and actionable correlations. Much more can be learned by having AI learn from humans-in-the-loop, as has been demonstrated by online games such as Stall Catchers (where players search a mouse brain for stalled blood vessels) and Foldit (where they earn points by folding long chains of amino acids into compact, three-dimensional shapes) that are turning the time people spend on video games into real and impressive contributions to science.

The cloud-based BRAIN Commons being built by the nonprofit Cohen Veterans Bioscience is harmonizing clinical, gene expression and phenotypic datasets, and overcoming some of the analysis challenges, to make multi-dimensional disease modeling possible. A logical role for citizen scientists would be the “soft labeling” of data in real-world contexts, Lancaster says. At McGill University, genomics research was gamified a decade ago with the launch of Phylo DNA Puzzle where players match colors and minimize gaps between tiles to improve the alignment of sequences. The game has had 350,000 participants and generated one million solutions, and a newer version called Borderlands Science released earlier this year with its own currency system already has one million users discovering 50 million new ways to do things, Waldispuehl reports.

According to a presentation by Virginia Commonwealth University’s James Ferri, a central portal for molecular AI activities known as MPrint-Open Knowledge Network is expected to significantly contribute to understanding the performance and properties of chemicals and materials involved in 96% of all manufactured goods. Most of the available data is of questionable quality and not in a machine-readable format, but the collaborative system uses observational data to automate and democratize the extraction process. When applied to “Wild West” marketplace data on toothpastes, MPrint-OKN found some very similar formulations (e.g., P&G and GSK products) and also that brand recognition is a greater driver of consumer preference than chemical features. “Information architecture, IA, is as important if not more important than AI,” Ferri says.

Washington University’s Richard Head introduced attendees to CompBio, an augmented intelligence system for comprehensive interpretation of biological data that is comprised two novel components—a hyper-dimensional memory model and an assertion engine for generating hypotheses that doesn’t need to be trained for the job. It is the first of two commercially available platforms, now in use by 150 labs at Washington University, which can identify potential regulatory mechanisms behind unexpected drug effects. A second platform (BioExplorer) has found unexpected associations between genes and cancer cells, and two other tools are under development related to adverse event reporting and analysis of fraud, waste and abuse.

The ability of machine learning to identify the causal drivers of clinical trial enrollment success was the theme of a presentation by Sylvia Marecki of EMD Serono and Omesan Nair at Merck KGaA. They employed an Operational Design and Study Accelerator (ODeSA) platform to learn which of 1,000 intrinsic (e.g., study design) and extrinsic (e.g., competing trials) factors were most impactful across 4,000 studies in the EMD Serono portfolio. Potentially actionable variables identified were geography, number of sites and investigator history of participation and performance. The data framework had good coverage (40% to 77% variance explained) and predictive performance. Based on a comparison of predicted enrollment with actual enrollment in oncology studies, ODeSA will be useful when modeling interventions.

The particulars of VAXI-DL, a web-based vaccine discovery system, was covered in a fast-paced session by Kamal Rawal of Amity University. The tool uses deep learning to classify proteins by their biological and physiochemical features and predict the best performing vaccine candidates. Out of a starter set of 9,174 features, 1,447 were identified as the most important set, with improved results using the clustering ensemble approach. Relative to existing tools VaxiJen and Vaxign-ML, VAXI-DL had better sensitivity, specificity, precision and recall, Rawal reports. Use of the system requires entering protein sequences in FASTA format and an email address. The system has been successfully deployed for vaccine candidates for infectious diseases, including Chagas disease and COVID-19, as well as cancer.

Roche’s Etzard Stolte led a session about an agile knowledge management integration and discovery platform with 22-plus capabilities and 10,000 users that epitomizes the shift to data fabrics and common semantics in response to challenges with traditional methods (index, glossary, catalog, data lake or warehouse). The focus is on creating all relevant data digitally, exploiting tacit knowledge, consistent architecture, adherence to FAIR (findability, accessibility, interoperability, and reusability) data principles, and building on previous knowledge. Any added AI capabilities will have to meet the 75% accuracy threshold so scientists can trust the results.

Semantic integration needs to become part of an integrated learning framework for more informed scientific decision-making, based on the strength of evidence, according to GlaxoSmithKline’s Samiul Hasan. GSK’s pilot efforts have identified “novelty factors” (new, trendy, established or outlier) associated with data and a Knowledge Lab that has linked cardiac troponin expression changes with bromodomain gene 4. Impacts have included evidence from a rare disease trial missed by the project team, mechanistic hypothesis not considered by the program team and identification of a plausible mechanism for a lab observation.

Ensuring previously collected biomedical data doesn’t get lost could go a long way toward ensuring drugs get properly dosed and used on the right group of patients as well as help with outcome-based reimbursement, says Roche’s Ewa Jermakowicz, who noted the “huge role” for data-sharing standards. When research data is not FAIR, the total cost of lost time alone has been pegged at 4.5 billion euros ($5.1 billion).

Matthew Trunnell presented the Chicagoland COVID-19 Data Commons, the first regional data commons from the Pandemic Response Commons focusing on COVID-19 data in Chicago and surrounding areas, and outlined a five-level Data Commons Maturity Model, representing the growing capabilities of mature data commons. The basic commons, Level One, offers shared access to datasets. A Level Two commons ensures that computational environment can be recreated, and both data and tools can be accessioned. Level Three commons bring in governance, defined processes for all aspects of operations and formal models for engaging stakeholders. Interoperability is a hallmark of Level Four commons, ensuring that commons can talk to each other. Finally Level Five commons is the “grand goal”, Trunnell said, a commons that is sustainable with investment from all of the data providers and consumers. The Chicagoland COVID-19 Data Commons was launched about a month ago, Trunnell said. Researchers can login with Google ORCID or InCommon credentials and can pull local health, demographic, and resident-reported symptom data into Jupyter notebooks. The consortium is driven by working groups and the governance is based on a series of legal agreements for partners and users.

The models aren’t right, John Quackenbush (Harvard Medical School) reminded us, but they do provide insight into disease. The blind application of machine learning approaches has not delivered, he argues, because of the limited problem-specific knowledge that goes into designing the algorithms. There is no free lunch, he says, citing a 1997 Wolpert and Macready paper. “We’ve actually got to build into the algorithms that we’re going to use information about the specific problems we’re trying to solve,” he said. Quackenbush and his colleagues are exploring networks to set biological constraints to better understand and solve problems. Biological systems are driven by complex networks and the structure of those networks informs our understanding of the biology—even though the network in each tissue, biological state, and individual is different. Quackenbush’s group has developed the Methodological Network Zoo, a freely-available collection of network algorithms each named for an animal (and one MONSTER), that are designed for specific types of input data, building biologically-informed networks.

Tom Plasterer of AstraZeneca outlined FAIR’s guiding principle: better data stewardship, not particular ways of implementing that. AstraZeneca wanted to FAIRify re-use cases, but only those that are “steel threads” that tie multiple communities together in service of a business question, he said. The process of FAIRifying re-use starts with capturing user stories, reviewing available data and building (or extending) existing models and ontologies, transform data into the model, enriching the master and running quality control on the FAIRified datasets. It’s useful to examine your data with a faceted browser (AstraZeneca uses DisQover from Ontoforce), Plasterer said, and settle on a publishing strategy: what will you do with the data, who will have access to it, etc. “Starting with a use-case driven approach, running through the “steel thread” and really making it go from the use case to things that are useful to the business and all the users, we found that to be a really powerful way of doing this, because then you’re never building things that are too far away from the scientific question and the science you need to satisfy,” Plasterer said.

Editor’s Note: Even if you missed the start of the event, Bio-IT World Conference & Expo virtual is still live. Register now for on-demand presentations from Day one and two and live access to the final day of programming.