ChatGPT overlooked retractions and other concerns in research articles when prompted to evaluate their quality, highlighting the limitations of the AI tool.
Not long ago, researchers may have spent weeks combing through the literature, looking for clues to piece together a perfect research paper. Now, many simply seek help from large language models (LLMs) like ChatGPT to speed up their work.
With academics increasingly relying on ChatGPT, Mike Thelwall, a data scientist at the University of Sheffield sought to understand its credibility. “We wondered whether ChatGPT knew about retractions and would report [them]…or whether it would filter them out of its database so that it wouldn’t report them,” said Thelwall.
Mike Thelwall, a data scientist at the University of Sheffield, investigated whether ChatGPT ignored retractions and other concerns in research articles.
Mike Thelwall
To find out, Thelwall and his team asked the LLM to assess the quality of discredited or retracted articles and discovered that ChatGPT did not flag the concerns.1 Their results, published in the journal Learned Publishing, emphasize the need for verifying information obtained from LLMs.
“This is a fantastic paper [on a] really, really important topic,” said Jodi Schneider, an information scientist at the University of Wisconsin-Madison, who was not involved in the study. The bottom line for researchers is “don’t trust any fact that is coming from AI [tools],” she noted.
For their study, Thelwall and his team identified 217 articles that either had controversial claims or had been retracted. They then submitted the article titles and abstracts to ChatGPT, requesting the tool to evaluate the quality—benchmarked against standard guidelines—of each paper 30 times, yielding 6,510 responses. They did not ask the LLM whether the article had been retracted upfront, “because that’s not what a user would do,” said Thelwall.
None of the 6,510 responses that ChatGPT generated mentioned that the articles were retracted or had been flagged for serious concerns. The tool scored a majority of the papers highly, indicating that the articles were world-leading or internationally excellent.
This surprised Thelwall. “I was really expecting a low score because of the retraction,” he said. “But it didn’t do that in nearly all cases.”
Although ChatGPT identified a few methodological weaknesses in the articles it scored lower, none of these criticisms were relevant to the articles’ retraction or correction statements. Only in five cases did the LLM mention that the study was part of a controversy.

Jodi Schneider, an information scientist at the University of Wisconsin-Madison, studies post-citation retractions.
© School of Information Sciences, University of Illinois Urbana-Champaign/ Thompson-McClellan Photography
To investigate further, the researchers directly asked ChatGPT whether specific claims from retracted articles were true. The LLM responded that the claims were likely to be true, partially true, or consistent with research almost two-thirds of the time. While it mentioned that some statements were not established or were unsupported by current research at times, it flagged a statement as false in only one percent of the cases.
The results did not surprise Schneider for two reasons. “We know that that LLMs lie. They tell us what we want to hear,” she said. Additionally, through her studies on post-retraction citations, she had observed that researchers constantly cite work that has been retracted, indicating that retracted research is still moving around.
She noted that one of the strengths of this study was how carefully the researchers chose the studies that had been retracted. “They gave a lot of thought to the subtleties of why things were retracted, and whether [in] certain reasons for retraction, the information might still be valid…like plagiarism,” she noted. Going forward, she said researchers could use this methodology to investigate whether other LLMs have a similar tendency.
Thelwall agreed, noting that they could conduct a similar study using more powerful models. But for now, he hopes that the results highlight the limitations of AI tools. “They can augment us; they can help us, make us more efficient,” he said. “But they can’t replace [us]. Not yet, anyhow.”