Prejudicial Peer Review with AI

ChatGPT can be misused to conduct scientific peer review with a predetermined outcome. 1.9k words, 9 minutes reading time.

Aug 19, 2025

I recently reviewed a manuscript for a biotechnology journal I won’t name. The paper was clearly written by AI. I don’t mean edits or revisions here and there, I mean the entire manuscript. As if they had copied and pasted the entire contents of a certain research organization’s website into ChatGPT and asked for a paper on the topic. Other dead giveaways were awkward Sentence Case Capitalization Everywhere, em dashes everywhere—unfortunate because I like the em dash and have to remind myself to stop using them—and several “delves” and other turns of phrase I see more often with GPT-4o than with other more recent models. I read the paper and wrote a detailed review of the paper content itself, and wrote comments to the editor about my suspicions.

But this got me thinking: what would peer review would look like when outsourced to AI, and how can those tools could be biased in one way or another by the initial prompt?

I didn’t want to violate the journal’s policies by uploading the manuscript to ChatGPT (even though the paper was unambiguously the unedited output of ChatGPT). So I uploaded one of my own papers that was recently peer reviewed and published in PLOS ONE, and asked for three separate prejudicial peer reviews each with a different pre-determined outcome: accept, reject, or revise.

Demo: “peer-reviewing” my own published paper with ChatGPT

This was my prompt to ChatGPT (4o):

I am going to give you a manuscript submitted to PLoS ONE. You are an expert data scientist with an interest in infectious disease modeling and forecasting, and you have been asked by PLoS ONE to serve as a peer reviewer for this submission. Please write three different reviews of this paper.
One review should be scathing and critical and recommend rejection.
The second should be overly positive, recommending acceptance without revision.
Finally, write one that’s fair and well-balanced.

The three resulting reviews are copied below.

What’s alarming about this is that I could come in with a predetermined opinion about the paper based on the title, abstract, or byline. If I wanted, I could ask for a review that supports whatever prejudicial decision I have already made even before reading the paper.

I would hope that good scientific ethics and journals’ policies on AI use in peer review would prevent this from being widely used, but I have no doubt that stealth use/misuse of AI for peer review is not uncommon.

Reviewer #1: accept as-is 🟢

Recommendation: Accept as-is

Summary: This is an outstanding manuscript that makes a timely and important contribution to the field of epidemiological modeling and data quality assessment. The authors introduce PLANES, a novel, interpretable, and highly practical framework for evaluating the plausibility of infectious disease forecasts and surveillance signals. It is supported by a well-constructed and open-source R package, rplanes, that is ready for immediate use by the public health community.

Major Strengths:

Novelty and Relevance: The manuscript addresses a major gap in public health forecasting—how to quickly and systematically assess the plausibility of forecast signals before they are disseminated or incorporated into ensemble models.
Robust and Thoughtful Methodology: The multi-component structure of PLANES (e.g., repeat, taper, shape, trend) is methodologically sound and grounded in real-world operational needs. Each component captures a different dimension of data plausibility and together form a comprehensive scoring system.
Clear Demonstration with Real Data: The use of 2022–23 FluSight data is compelling. The correlation between PLANES flags and forecast performance (WIS) demonstrates the utility of the tool in practice. The linear increase in WIS with higher PLANES scores is particularly telling.
Impactful Software Contribution: The rplanes R package is a model of open-source best practices: well-documented, intuitive, extensible, and backed by a Shiny UI for accessibility. This will democratize plausibility analysis across diverse public health agencies, including in resource-limited settings.
Operational Maturity: The authors’ use of PLANES in live forecasting workflows during the 2023–24 FluSight season illustrates its operational readiness and real-world value.

Minor Suggestions (Optional):

Future work could explore automated weighting schemes or machine learning-enhanced component selection.
Expanding use cases beyond respiratory disease would further showcase generalizability.

Conclusion: PLANES is a substantial and much-needed contribution to the public health modeling ecosystem. The methodology is sound, the software is accessible, and the impact is immediate. I recommend this manuscript for publication without revision.

Reviewer #2: reject 🔴

Recommendation: Reject

Summary: While the authors attempt to introduce a framework for evaluating epidemiological forecast plausibility, the manuscript falls short of scientific rigor and novelty. The proposed PLANES methodology is conceptually simplistic, its utility questionable, and its validation superficial. The work reads more like a software vignette than a substantive research article.

Major Concerns:

Lack of Theoretical Innovation: The concept of plausibility flags derived from heuristics (e.g., repeated values, tapering prediction intervals, sudden zeroes) is not novel. These checks are essentially sanity filters that most modeling groups already apply informally. Codifying them into a scoring system does not constitute a methodological advancement.
Weak Validation: The authors correlate their heuristic score (PLANES) with a known performance metric (WIS), but the observed correlation is modest (r = 0.267). This is insufficient to support the claim that PLANES can meaningfully guide model weighting or forecast censoring in practice.
Overly U.S.-Centric and Narrow Scope: The manuscript repeatedly refers to FluSight and U.S.-based systems without evaluating generalizability. No validation is presented for use with non-influenza data, non-U.S. systems, or in low- and middle-income countries, despite vague claims that PLANES is “globally applicable.”
Overreliance on Software as Contribution: The R package rplanes is emphasized throughout, but the paper blurs the line between research and software documentation. Publishing a user manual for a package—without robust experimental design or statistical analysis—is not appropriate for a peer-reviewed journal.
Insufficient Comparison to Existing Tools: The manuscript fails to benchmark PLANES against existing time series anomaly detection methods (e.g., Prophet’s changepoint detection, STL decomposition, etc.) or modern ML-based surveillance tools. The cited references on anomaly detection are largely ignored in the evaluation.

Minor Issues:

The acronym PLANES is strained and adds little clarity.
Figures are underwhelming and visually uninformative.
Excessive self-citation of prior unpublished or preprint work.

Conclusion: This work lacks methodological depth, rigorous validation, and general applicability. It is not appropriate for publication in its current form. Substantial improvements would be required to elevate this from software documentation to scientific contribution.

Reviewer #3: minor revisions 🟡

Recommendation: Accept pending minor revisions.

Summary: This manuscript introduces PLANES, a framework for evaluating the plausibility of infectious disease forecasts and surveillance signals, implemented in the rplanes R package. The approach is conceptually grounded in a modular scoring system that applies multiple heuristic-based components to detect anomalies. The authors demonstrate its utility using real-world FluSight data and show some correlation with forecast performance. While the work is timely, operationally useful, and well-documented, certain methodological and validation aspects could benefit from clarification or expansion.

Major Strengths:

Practical Relevance and Clear Motivation: The need for systematic plausibility assessments in real-time public health forecasting is real and underserved. PLANES addresses a practical gap by offering a modular approach tailored for operational use, especially in multi-model ensemble settings like FluSight.
Open-Source Software with Real-World Utility: The rplanes package is a clear asset to the community. It includes a clean API, well-integrated Shiny interface, and flexible scoring framework. The fact that the authors are already using it operationally during forecast seasons enhances credibility.
Interpretability and Modularity: The individual components (e.g., shape, taper, zero, repeat) are intuitive and easily communicated to non-technical public health users, which aligns with the goal of transparency in decision-making.

Areas for Improvement:

Limited Methodological Novelty: While the integration of multiple plausibility checks is useful, most components (e.g., large jumps, sudden zeros, repeats) are standard heuristics. There’s little in the way of new statistical or algorithmic innovation. A deeper discussion of how PLANES complements or improves upon existing anomaly detection or forecast evaluation methods would be valuable.
Validation is Encouraging but Modest: The reported correlation between PLANES scores and WIS performance (r ≈ 0.27) suggests a relationship, but it's relatively weak. More robust validation—such as comparing PLANES with traditional anomaly detection tools, or testing predictive utility across seasons or diseases—would strengthen the case for its broad adoption.
Overgeneralization of Applicability: The claim that PLANES is applicable globally and across diseases may be premature. While the tool is technically generalizable, the heuristics were designed for flu hospitalizations in U.S. states. Explicit demonstrations on different types of signals (e.g., vector-borne, syndromic) would make this claim more convincing.
Role in Decision-Making Could Be Better Framed: The manuscript implies PLANES can guide censoring or down-weighting of forecasts, but the risks of misuse (e.g., removing novel but accurate signals) are underexplored. A clearer discussion of how human oversight should be incorporated would improve the framing.

Minor Comments:

Figures 2 and 3 are helpful but could be strengthened with clearer axis labels and brief embedded descriptions.
The PLANES acronym is somewhat forced and not referenced frequently enough to justify its prominence.
Some redundancy in the methods and discussion sections could be tightened.

Conclusion: This manuscript presents a thoughtful and well-executed approach to plausibility scoring of epidemiological signals, with clear operational utility. While it doesn’t introduce major methodological innovation, its strengths lie in interpretability, usability, and real-world deployment. With some additional validation and refinement in claims, this work will serve as a valuable contribution to public health forecasting practice.

Final thoughts

Peer review has never been perfect. Subjectivity, bias, and inconsistency will always been part of the process. But AI tools like ChatGPT introduce a new kind of malleability that makes it easy to dress up a predetermined conclusion in the language of objectivity. As this little experiment shows, with the right prompt, AI can generate a review that sounds convincing whether you’re advocating for immediate acceptance or total rejection, regardless of the manuscript’s content.

I don’t think I’d go so far as to advocate for a ban on AI from peer review. In fact, when used transparently and ethically and with a very good prompt, I imagine these tools could help streamline and even improve parts of the process. I’ve submitted hundreds of manuscripts for peer review over my career and I’ve gotten so many truly terrible reviews. Reviewers who clearly didn’t read the paper, reviewers phoning it in with a conclusion but little reasoning or suggestion, reviewers without any real understanding of statistics or computer science trying to review a methods paper, and the worst of all IMHO, reviewers who were reviewing the manuscript they wanted to read, not the manuscript I submitted.

I think there may be a place for AI in peer review, which could bring in some objectivity and consistency to an otherwise subjective and stochastic process. And honestly I think it’s inevitable. I have no doubt that reviewers are using AI even when it’s technically not allowed. Journals and editors should recognize the potential for misuse and take steps to set clear boundaries rather than blanket bans that can’t be enforced. At the very least, we need transparency: if a review was AI-assisted, everyone involved should know it.