Gemini, ChatGPT leak real phone numbers from training data
Key insights
- Gemini, ChatGPT, and Claude have all been documented surfacing real private phone numbers sourced directly from training data.
- The retrieval effort is lower than data brokers or search engines, requiring no specialized tools or adversarial prompting.
- Google confirmed an investigation but published only a support document; affected users report no substantive response after weeks.
Why this matters
Any model trained on web-scale data is a potential PII retrieval system, meaning this is a property of the training paradigm itself rather than a bug fixable with a targeted patch. For founders building on top of foundation models, this creates downstream liability exposure if their products facilitate the surfacing of personal contact information, even unintentionally. For technical leaders evaluating AI-as-search deployments, the shift from chatbot-as-novelty to chatbot-as-default-query-interface dramatically scales the exposure, because user volume becomes the attack multiplier.
Summary
Gemini, ChatGPT, and Claude are surfacing real, privately held phone numbers belonging to individuals who never consented to being findable through an AI interface. MIT Technology Review documented specific cases where chatbots returned accurate personal contact information sourced from model training data, with researchers noting the effort required is far lower than using data brokers or conventional search engines.
The mechanism is straightforward: PII absorbed during training can be reproduced verbatim when users prompt models in the right way. This isn't a novel attack vector requiring adversarial inputs; ordinary queries are enough. Google confirmed it is investigating flagged cases and published a support document for correction requests, but affected individuals report weeks of silence after submitting them.
Essentially: (Google, OpenAI, Anthropic) have built retrieval systems that can function as low-friction people-finders.
- Researchers frame this as a structural privacy risk distinct from data breaches: the data was absorbed, not stolen, and correction is opt-out by design.
- Google's public response so far is a support document with no confirmed remediation timeline for flagged numbers.
- As chatbots displace search engines as default query interfaces, the exposure surface grows with adoption, not with any discrete incident.
The core problem is that training-data PII leakage has no patch cycle and no breach notification requirement, which means millions of affected individuals have no legal trigger informing them they're exposed.
Potential risks and opportunities
Risks
- Google faces regulatory scrutiny in the EU under GDPR Article 17 if individuals submitting erasure requests for surfaced phone numbers receive no substantive response within the statutory one-month window.
- OpenAI and Anthropic could be named in class-action suits by individuals whose private numbers are reproducible via ChatGPT or Claude, particularly if plaintiffs can show foreseeability given prior published research on training-data memorization.
- Enterprises deploying chatbots as internal or customer-facing search tools inherit the PII leakage risk from underlying foundation models, exposing them to state-level privacy enforcement even when the AI vendor absorbs no direct liability.
Opportunities
- PII detection and redaction vendors (Nightfall AI, Private AI, Gretel) are positioned to market training-data auditing and memorization-detection tooling directly to foundation model developers facing regulatory pressure.
- Privacy law firms specializing in GDPR and CCPA enforcement gain a new docket category: training-data PII retrieval claims that don't map cleanly to existing breach frameworks, creating a first-mover advantage for practices that develop the playbook now.
- Startups building opt-out infrastructure or model-level unlearning capabilities (a known research gap) become acquisition targets for Google, OpenAI, and Anthropic if regulatory timelines tighten around demonstrable PII removal.
What we don't know yet
- Whether OpenAI and Anthropic have received formal correction requests for surfaced phone numbers and what their response timelines look like, as of May 2026.
- What volume of distinct individuals' phone numbers researchers found reproducible across models, and whether certain data source categories (public directories, scraped social profiles) are disproportionately represented.
- Whether existing privacy regulations (GDPR, CCPA) impose any breach-equivalent notification obligation when training-data PII is demonstrably retrievable on demand.
Originally reported by technologyreview.com
Read the original article →Original headline: MIT Technology Review: AI Chatbots Including Gemini Are Surfacing People's Real Phone Numbers From Training Data