paper web signal

Google's Paper Assistant Tool Reviewed 4,700 Papers at ICML and STOC

TL;DR

  • PAT achieved 89.7% accuracy on the SPOT benchmark for math error detection, up from 55.2% for zero-shot Gemini 3.1 Pro.
  • Over 4,700 manuscripts were reviewed at STOC and ICML; 97% of STOC and 92.1% of ICML respondents said they would use PAT again.
  • Combined AI conference submissions are projected to reach 73,883 in 2026, up from 32,628 in 2024 and 45,354 in 2025.

Peer review at major AI conferences is straining under a submission load that has more than doubled in two years. According to Google researchers publishing on arXiv, combined submissions to ICLR, ICML, and NeurIPS grew from 23,838 in 2023 to 32,628 in 2024 to 45,354 in 2025, with the paper projecting 73,883 in 2026. A Google team led by Rajesh Jayaram, and including Corinna Cortes, Yossi Matias, and David Woodruff, built the Paper Assistant Tool (PAT) to address this directly.

PAT is not a replacement reviewer. The paper proposes a four-level taxonomy of AI involvement in peer review and explicitly positions PAT at Level 1: a tool authors run on their own manuscripts before submission. The system segments papers into sections, allocates more compute to proofs than to introductions, and synthesizes a comprehensive report. The paper notes that verifying complex mathematical claims requires generating a large number of thinking tokens, which can easily exceed a single model's context capacity, hence the agentic multi-step architecture. PAT was piloted at STOC in November 2025 and ICML in January 2026, with over 4,700 submissions reviewed across both venues.

The headline result is on mathematical error detection. Against the SPOT benchmark, which compiles manuscripts containing verified mistakes that led to subsequent errata or retractions, zero-shot Gemini 3.1 Pro reached 55.2% accuracy; PAT reached 89.7%, a 34% improvement. Author feedback was broadly positive: 97% of STOC respondents and 92.1% of ICML respondents said they would use PAT again, and 31% of ICML respondents reported running new experiments based on the tool's feedback.

The paper is candid about what can go wrong. The authors flag the risk of falsely claiming a proof or argument is incorrect due to failures in reasoning or model misunderstandings, and warn of the need to guard against cognitive complacency, specifically the deskilling of human reviewers who may reduce their scrutiny once AI has pre-screened a paper. They also name the risk of authors adversarially gaming review agents once evaluation criteria become known, and cite algorithmic biases as a structural concern. The honest caveat here is that author satisfaction surveys measure experience, not accuracy, and the paper does not report how often PAT incorrectly flags valid proofs.

What the paper does not provide is outcome data: whether manuscripts that went through PAT actually fared better in human review. That question matters for understanding whether the benchmark lift translates to real scientific quality gains. For now, the most direct beneficiaries are authors who previously had no structured pre-submission feedback at all, a group that skews toward early-career researchers without access to strong mentors.