Generative AI: ChatGPT just hallucinated 52% less by default May 13th 2026

Curated by Alexis

The model-release calendar went quiet, but the distribution layer moved more than April did. OpenAI flipped ChatGPT's default to a model that hallucinates roughly half as often on prompts most likely to hurt users. A Miami startup with no peer-reviewed paper claimed a 12M-token context window and got $29M the same day. Apple confirmed iOS 27 will plug Claude or Gemini into Siri system-wide. If you build on the foundation layer, the surface underneath you just shifted.

Watch & Listen First

State of AI in 2026: LLMs, Coding, Agents — Lex Fridman #490 (YouTube)
Lambert and Raschka, four hours on where reasoning models stand and whether open-weight labs structurally close on the frontier.

AI + a16z: MCP Co-Creator on the Next Wave of LLM Innovation (Spotify)
Anthropic's David Soria Parra on MCP's origin and the next integrations — the protocol layer underneath half the agentic tooling shipping now.

Latent Space: The AI Engineer Podcast (Spotify)
swyx and Alessio on coding agents and how production engineers actually wire frontier models together.

Key Takeaways

Defaults are the new benchmark. OpenAI cut hallucinations 52.5% on high-stakes topics by changing what ships at chatgpt.com, not by training a smarter Pro tier.
Sub-quadratic attention is a product, not a paper. SubQ's 12M-token window is unverified, but it's the first commercial deployment of post-transformer attention research that's been circling arXiv for two years.
Apple is unbundling Apple Intelligence. iOS 27 Extensions lets third-party models drive Siri, Writing Tools, and Image Playground system-wide — plug-and-play on 1B+ devices.
RL for reasoning is being reframed, not scaled. Two papers argue RL doesn't teach new capability — it selects sparse policies the base model already contains, wiping out most of the claimed compute moat.
Agentic Android ships in months. Gemini Intelligence builds a shopping cart from a grocery list screenshot with one confirm — a full quarter before Apple's equivalent.

The Big Story

OpenAI Flips ChatGPT's Default to GPT-5.5 Instant, Cuts Hallucinations 52.5% · May 5, 2026 · OpenAI

→ The model isn't the headline — the rollout is. 5.5 Instant produced 52.5% fewer hallucinated claims than 5.3 on high-stakes prompts (medicine, law, finance), with 37.3% fewer on user-flagged factual-error conversations. It also uses ~30% fewer words per answer, so token-cost per useful reply drops without any pricing change. It matters because it's the default: every free chatgpt.com user gets it, and developers hitting chat-latest inherit it on next deploy. RAG guardrails built against 5.3's failure modes are now over-engineered; teams pinned to 5.3 need to retest evals before users notice answers feel different.

Also This Week

Subquadratic Launches With $29M, Claims 12M Context at 1,000x Lower Compute · May 5, 2026 · Subquadratic
→ SSA learning which token pairs actually matter is real research, but the company shipped no peer-reviewed paper, claims 92.1% needle-in-a-haystack at 12M, and earlier attempts (Mamba, RWKV) hybrided when they couldn't match dense quality; benchmark before you migrate.

Apple to Let iOS 27 Users Swap in Gemini or Claude for Siri · May 5, 2026 · Bloomberg
→ Extensions wires installed AI apps directly into Siri, Writing Tools, and Image Playground with distinct voices per provider — iOS 27 becomes a model-routing OS, and OpenAI's exclusive Apple slot dies the day Anthropic and Google ship.

Google Ships Agentic Gemini Intelligence Across Android Ahead of I/O · May 12, 2026 · Google Blog
→ Long-press the power button over a grocery list, ask for a cart in your shopping app, confirm checkout — the first agentic phone behaviour shipping to actual Samsung and Pixel install base.

Cursor Composer 2 Drops to $0.50/$2.50 per Million Tokens · May 2026 · Cursor
→ Frontier-tier coding priced 5x below GPT-5.4 with sub-30-second turns and 8 parallel agents in isolated worktrees — cost-per-finished-task is falling faster than headline token price.

From the Lab

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning · May 7, 2026 · arXiv 2605.06241
→ Akgül et al. show RL post-training doesn't teach reasoning — it redistributes probability onto solutions the base model already contained. ReasonMaxxer matches full RL with tens of problems and minutes of single-GPU training, wiping out three orders of magnitude of the compute moat big RL pipelines claimed.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key · May 7, 2026 · arXiv 2605.06638
→ RL training compute follows a power law in reasoning depth with the exponent climbing as logical expressiveness rises — deeper reasoning gets exponentially more expensive to train, which is why today's reasoning models hit a wall around 50–60 steps.

Worth Reading

State of Open Source on Hugging Face: Spring 2026 — 11M users, 2M models, Chinese labs at 41% of all downloads — open-weight gravity has structurally shifted east.
Cloudflare: Unweight — compressing an LLM 22% without sacrificing quality — Tensor compression deep-dive from the team quietly making frontier models cheap to serve at the edge.
What's new in Claude Opus 4.7 — xhigh effort, task budgets, and a new tokenizer — most teams are missing the 35%-more-tokens pricing footnote.

When the model layer pauses, distribution and architecture move.

Get more from AI Weekly

More signal, less noise — pick your channels.