Nihar Shah, an accomplished artificial intelligence (AI) researcher and associate professor at Carnegie Mellon University, delivered a seminar at the Center for Language and Speech Processing (CLSP) on October 10th titled “LLMs in Science, the good, the bad and the ugly.” The seminar purveyed the role of AI in scientific research and peer review.
Shah primarily focused on the evolving role, potential and consequences of large language models (LLMs) in scientific discovery as they become more frequently adopted in research communities worldwide. He opened by addressing a persistent challenge in academic research: the effectiveness of peer review. In a study involving top-tier conferences, such as NeurIPS, AAAI and ICML, Shah and his team examined the ability of their reviewers to spot errors in submitted papers. His team planted intentional errors in three different papers: one paper was designed to include a blatant error, one was less apparent and the last was embedded with an extremely subtle error.
Of the 79 total reviewers, Shah reported that when analyzing the results of the paper containing the blatant error, 54 of the reviewers made no comments on the sections containing the errors, 19 believed the part was sound, while only one questioned, “this looks really fishy.” His findings revealed a glaring vulnerability in the current peer review system, which is largely attributed to the constant pressure and time constraints reviewers often face.
In addition to these challenges, Shah commented on the value of ethical science practices. He noted how fraud has increasingly infiltrated the process of peer review, citing collusion rings and the ability to manipulate paper assignments by selectively bidding on papers. Another concern he mentioned is how reviewers can self-select which papers to upload to the conference portal. This causes the system to generate an inaccurate profile proclaiming them as subject experts for those specific works. Furthermore, drastic cases have ensued where individuals use fraudulent email accounts, often associated with an accredited institution, to pose as a qualified reviewer.
To combat these issues, Shah mentioned protocols, such as submitting trace logs, to specifically target fraudulent reviewing. For instance, trace logs detail a timestamped list of when reviewers accessed each component of the manuscript, what tools they used and the comments they left behind. This measure is designed to prevent reviewers from skipping parts of a paper and fabricating analyses they claim to have done. Despite these safeguards, the peer review process remains highly susceptible to human error and fatigue. LLMs, however, can operate effortlessly and tirelessly.
Transitioning from human to machine evaluation, Shah compared the performance of advanced LLMs like OpenAI’s GPT-4 to that of human reviewers. LLMs were consistently able to catch the most obvious flaws “across many, many runs every single time.” However, the less apparent error was only identified after the prompt was reworded to guide or steer the LLM’s focus toward that particular section of the manuscript.
”When you specifically asked [GPT-4] to look at that [apparent error in the paper], [GPT-4] said, ‘Oh, yeah, here’s the problem,’” Shah commented. This demonstrates that LLMs can effectively identify issues when prompted but cannot replace human expertise yet.
Concluding the seminar on a forward-looking note, Shah explored the rise of AI scientists. These are systems that can generate hypotheses, design experiments and write research papers without any human intervention. “When you give it some broad direction, it can do all the research, including generating a paper,” Shah said.
Shah emphasized the immense potential for AI scientists to dramatically accelerate the pace of discovery by speeding up routine tasks currently consuming human researchers. However, Shah cautioned on the many issues with AI scientists. These include generating artificial data sets, engaging in p-hacking and selectively reporting benchmarks. For example, his team discovered that AI scientists sometimes select and report only the best-performing results.
“A key takeaway is that LLMs present great opportunities [but] also challenges, and whenever we’re looking at this program, we’d like to think about them from whatever first principal objectives of science.” As Shah iterated how LLMs offer significant opportunities, he implored the audience to constantly approach AI adoption with scrutiny, especially in the context of scientific integrity.