AI Betting Agents Weigh In On Reproducibility Of Research Findings

April 5, 2022

By Deborah Borfitz

April 5, 2022 | Penn State researchers have created synthetic “prediction markets” for speculating on the replicability of published studies with unparalleled accuracy. When betting on published social and behavioral science papers, about 90% of the time bot agents correctly “bought” papers that were eventually reproduced, according to Sarah Rajtmajer, Ph.D., assistant professor in information sciences and technology. In cases where they didn’t have enough information, bot agents knew not to make bets.

The researchers, mostly computer scientists, were part of a broader, three-year project evaluating confidence in science and bot-based markets was their approach to the problem. DARPA, the Defense Advanced Research Projects Agency, launched the Systematizing Confidence in Open Research and Evidence (SCORE) program with the intent of helping decision-makers in the defense community know what research they can trust and why, but Rajtmajer says she believes the artificial intelligence (AI) tactics employed by SCORE’s “TA3” team at Penn State have applications in public-facing science.

For scientists like Rajtmajer, AI bidding agents could help put individual studies in the context of a larger body of literature so they know if the findings would hold if repeated, possibly in a different population or using a different analysis method, she says. Getting at what matters, in terms of replication, will take a while and a whole lot more training data, but over the longer term could inform what good science looks like.

The DARPA program, which concludes at the end of May, involves eight primary research teams organized into three technical areas (TAs). The TA1 team, at the Center for Open Science (COS), was tasked with replicating hundreds of published claims and annotating several thousand research claims, Rajtmajer says.  

That massive dataset was then handed off to TA2 teams from KeyW/Jacobs Corporation and the University of Melbourne, which asked human experts to provide “confidence scores” predicting the reproducibility or replicability of the claims, she continues. In this role, the University of Melbourne spearheaded Collaborative Assessment for Trustworthy Science (repliCATS) crowdsourcing evaluations of the credibility of published research in eight social science fields: business research, criminology, economics, education, political science, psychology, public administration, and sociology.

Data from repliCATS was used by the COS to identify the best human experts, based on their performance against the initial replication exercise, and their scores were given to TA3 teams at the University of Southern California, TwoSix Labs (Arlington, Virginia) and Penn State, says Rajtmajer. The three groups were asked to develop machine learning algorithms to assign confidence scores to published claims just like the human evaluators.

Only Penn State opted to adopt the “out-there approach” mimicking crowdsourced prediction markets already being used to help anticipate everything from elections to sports scores, she notes. The test prototype was demonstrated at the Association for the Advancement of Artificial Intelligence (AAAI) conference in February, and the feat required not just Rajtmajer but also a team of computer scientists including principal investigator (PI) C. Lee Giles, David Reese Professor at the College of Information Sciences and Technology at Penn State and creator of the CiteSeer academic search engine. 

“We have already seen that humans are pretty good at using prediction markets to estimate reproducibility,” Giles says. “But, here, we’re using bots in our market, which is a little unusual but really promising.”

In addition to Rajtmajer and Giles, PIs on the bot project include Jian Wu, assistant professor in computer science at Old Dominion University; Christopher Griffin, Applied Research Laboratory at Penn State; James Caverlee; professor of computer science at Texas A&M; Anna Squicciarini, Frymoyer Chair in Information Sciences and Technology at Penn State; Anthony Kwasnica, professor of business economics at Penn State; and David Pennock, director of the Center for Discrete Mathematics and Theoretical Computer Science at Rutgers.

Mid-program, when the pandemic hit, DARPA had SCORE teams pivot to evaluate all the emerging COVID-19 research. Despite the challenge of finding relevant papers for algorithm training purposes—much of it from previous SARS- and MERS-related epidemics—TA3 teams did “respectably,” says Rajtmajer. Throughout the program, performance evaluation of all SCORE teams was managed by a “Test and Evaluation” (T&E) team of researchers from RAND and MITRE.

Focus On Features

The uniqueness of the approach taken by the Penn State team is that each of the participating bots cares more or less about different features extracted from the papers in question and associated metadata (e.g., sample size, p-values, authors’ institutions, funding sources)—a total of 41 paper-level features in the prototype demonstrated at AAAI. Based on these features and the training data to which they’ve been exposed, each bot has an opportunity to buy contracts associated with replication outcomes (i.e., if they will or will not replicate), she explains.

“At this point, we can’t say that one particular feature across the board is most important for replication,” Rajtmajer says. “Our bots only know what they’ve seen and, as it stands, we are dealing with imperfect proxy training data, the total size of which is only about 3,000 claims.”

Development of novel methods for claim-level feature extraction are now underway, which will allow the bots to differentially score multiple claims within the same paper, adds Wu, lead researcher for the feature extraction effort. “We are now extracting more nuanced information from text, such as theories and models motivating claims, but it’s still not clear that we are capturing the same things that have meaning to human expert reviewers when they read a paper.”

For DARPA, the end goal of SCORE is to build automated tools that can rapidly and reliably assign confidence scores to social science research claims. “At this point, all our algorithms have been trained on are papers in the social and behavioral sciences… where the attention around the replication crisis first emerged,” Rajtmajer says. “The defense community has no idea… how to sort through the noise.”

The Penn State team has already put in a proposal to the National Science Foundation for a grant to build out its bot-based approach for computer science where the reproducibility problem is equally serious but less publicized, Rajtmajer says.

‘Reviewer Four’

The vision now being cast by some SCORE teams is to stand up an assortment of publicly accessible tools, says Rajtmajer. But for the immediate future, the focus will rightfully be on improving confidence in research claims. “Being able to understand what science is credible and does it generalize—and aggregating scientific outcomes in a way the public can trust—are more important now than ever.”

Confidence scores generated by the algorithms should function as “reviewer four” in the peer review process, Rajtmajer says. “We are not thinking about replacing a human expert in terms of evaluating a paper but providing an additional view. But we know that human reviewers are necessarily going to have a narrow view of the literature just because they haven’t read millions of papers… and they have inherent biases.”

It is well acknowledged that peer review is a flawed process, continues Rajtmajer, “but we have not had a good alternative.” Although she doesn’t yet have faith in the confidence scores her algorithm puts out, she adds, “I think the explanations could definitely be useful”—e.g., a paper may be given a low chance of reproducing simply because the model had never seen such a large sample size before. 

The system Rajtmajer intends to build out for computer science and AI research papers will not necessarily even feature a numeric score because it could provide a misleading sense of certainty and could be “weaponized,” she says. Instead, the algorithm will indicate high, medium, or low confidence in reproducibility in various dimensions (e.g., study methodology or sentiment of downstream citations) that are more qualitative.

Bigger Picture

Researchers at Penn State are about to invite human experts to participate in the bidding alongside the bots to see how that improves performance. “I think our most exciting opportunity is for human-AI collaboration in this space,” Rajtmajer says, noting that interest in public-facing prediction markets has been reinvigorated due to blockchain-based protocols.

The literature has at best 25 papers on synthetic, or artificial, prediction markets, which are inherently “just another machine learning algorithm,” she notes. Attempts to pair human wisdom with AI has always been a before-or-after proposition because “you can’t insert a human in the black box of the neural net. But in our market [published research], we hope to… integrate machine rationality with human intuition.”

As developed through the SCORE program, humans will be in the mix in for the purposes of evaluating scientific replicability. But in the future, bot-based markets could be used for a broader set of tasks, says Rajtmajer.

For a bidding agent to say “I don’t know” is a fair outcome, she notes. In results presented at the AAAI conference, the system provided confidence scores for only about 35% of 192 ground truth replication studies.

“Participation of our bots in the market is voluntary,” explains Rajtmajer. “If agents aren’t close enough to some information that they care about, they won’t buy anything.” It’s the same in human prediction markets where “people just won’t participate if they have no idea about the outcome.”

The team could have amended the algorithm to enlarge their radius for looking at information, but at the expense of precision. “We can force our agents to buy stuff, but at this point we have selected a threshold where we feel like we’re getting super-high accuracy [90% F1 score] on the papers that we do score,” she says.

For purposes of the DARPA project, the team’s scores for overall accuracy were lower because a score was required for every test point, Rajtmajer adds. When the market couldn’t produce a score for a claim, it was arbitrarily assigned a value of 0.5 on the 0-to-1 rating scale.