marktechpost.com web signal

Google's Gemini-SQL2 Tops BIRD Text-to-SQL at 80.04%

google enterprise ai generative ai enterprise-ai benchmark text-to-sql

Key insights

  • Gemini-SQL2 scored 80.04% execution accuracy on the BIRD single-model leaderboard, built on Gemini 3.1 Pro.
  • Human performance on BIRD stands at 92.96%, leaving a 12.92-point gap above Gemini-SQL2's score.
  • No API, model card, or technical report for Gemini-SQL2 has been publicly released as of June 12, 2026.

Why this matters

Google holding both top spots on BIRD's single-model track with Gemini-SQL2 (80.04%) and the original Gemini-SQL (approximately 77.2%) establishes a benchmark lead over AWS's Q-SQL and Anthropic's Claude Opus 4.6 at a moment when enterprise AI spending is heavily weighted toward data analytics. The 12.92-point gap to human performance (92.96%) signals that text-to-SQL is approaching deployment thresholds where the competitive question shifts from 'can AI write SQL' to 'which vendor's AI writes it best'. Google has released no technical report or API, meaning the benchmark claim cannot be independently audited or reproduced by competitors or enterprise buyers.

Summary

Google Research announced Gemini-SQL2 on June 12, 2026, built on Gemini 3.1 Pro, scoring 80.04% execution accuracy on the BIRD single-model text-to-SQL leaderboard. BIRD covers 12,751 question-SQL pairs across 95 databases in 37 professional domains, testing whether generated SQL actually runs and returns correct results. Human performance stands at 92.96%, a 12.92-point gap above Gemini-SQL2. Essentially: (Google, AWS, Anthropic) are competing on enterprise text-to-SQL, with Google now holding the top two named positions on the BIRD single-model track. - Google's prior Gemini-SQL sits at approximately 77.2%; AWS's Q-SQL follows at roughly 76.5%. - Anthropic's Claude Opus 4.6 scores approximately 70.1% on the same benchmark. - No API, model card, or technical report has been released; potential integration targets include BigQuery Studio, AlloyDB AI, and Cloud SQL Studio. At 80% accuracy, automated SQL generation is credible for accelerating data workflows, but not yet reliable enough to remove human review from production pipelines.

Potential risks and opportunities

Risks

  • Enterprises deploying Gemini-SQL2 against production BigQuery pipelines face a baseline failure rate: at 80.04% accuracy, roughly 1 in 5 queries requires human correction.
  • AWS (Q-SQL at approximately 76.5%) and Anthropic (Claude Opus 4.6 at approximately 70.1%) both trail Gemini-SQL2 by significant margins on BIRD, risking procurement losses in enterprise text-to-SQL evaluations.
  • Without a published technical report, competitors and third-party evaluators cannot audit the methodology behind Gemini-SQL2's 80.04% score, leaving Google's benchmark claim effectively unverifiable.

Opportunities

  • Google Cloud is positioned to accelerate adoption of BigQuery Studio, AlloyDB AI, and Cloud SQL Studio if Gemini-SQL2 integrates into those products, creating a measurable moat in enterprise data analytics.
  • SaaS vendors and enterprise data teams building embedded natural-language query interfaces gain a credible performance reference point once Gemini-SQL2 becomes API-accessible.
  • AWS and Anthropic face clear incentive to publish competing BIRD results or updated benchmark scores for Q-SQL and Claude-based SQL pipelines before Google's integration advantage solidifies.

What we don't know yet

  • No API, model card, or technical report released; timeline for public access or integration into BigQuery Studio and AlloyDB AI is entirely unspecified.
  • What architectural changes over the original Gemini-SQL drove the jump to 80.04%, given no technical paper has been published to explain the improvement.
  • Whether the BIRD single-model score of 80.04% across 95 curated benchmark databases translates to comparable accuracy on real enterprise production databases.