Wayback Machine becomes collateral damage in AI-publisher war
TL;DR
- The Internet Archive's Wayback Machine is being blocked by major news sites whose target is AI scrapers, not the archive itself.
- Originality AI found 23 major news sites block ia_archiverbot; a broader count flagged 241 sites across nine countries, 87% owned by USA Today Co.
- The New York Times added archive.org_bot to robots.txt in late 2025; Reddit announced its block in August 2025.
There is a quiet, structural casualty in the fight between AI companies and news publishers, and it is the public record itself. According to The Independent, news sites and platforms including Reddit are increasingly blocking the crawlers the Internet Archive uses to snapshot the web, because the same bots that preserve a page for posterity are, from a publisher's seat, indistinguishable from the bots that hoover up content to train large language models.
The scale is more pointed than it sounds. An analysis from AI-detection startup Originality AI counted 23 major news sites currently blocking ia_archiverbot, the crawler the Wayback Machine commonly uses, and a broader tally found 241 news sites across nine countries that disallow at least one of four Internet Archive bots, with 87% of those owned by USA Today Co. The New York Times added archive.org_bot to its robots.txt at the end of 2025; Reddit announced its own block in August 2025.
Publishers, asked, say the cause is not the Archive itself. A USA Today Co. spokesperson, Lark-Marie Anton, framed the move as "not about specifically blocking the Internet Archive" but a wider response to "all scraping bots." A New York Times spokesperson, Graham James, was more direct: "The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law." Mark Graham, the Wayback Machine's director, calls his project "collateral damage caught up in the conflict between AI companies and publishers."
What is worth holding onto is what the Wayback Machine actually does in practice. It is the tool journalists, researchers and Wikipedia editors reach for when a story is silently amended, a quote is removed, or a page disappears entirely. The honest caveat is that the reporting does not quantify how much of the historical record has already been pulled offline, and the publishers' copyright complaint against AI scrapers is itself unresolved. But the structural problem the reporting describes is real: a piece of public infrastructure built to keep the web honest is being throttled as a side effect of a fight it is not party to.
If there is an upside, it is that the situation is now visible enough to act on. A technically distinct, auditable archive crawler that AI companies agree not to scrape from would solve most of this, and the people most likely to push for it are exactly the historians, lawyers and reporters who already rely on the Wayback Machine to do their work.
Shared on Bluesky by 2 AI experts
-
AI has started blocking the Wayback Machine from chronicling the web. Considering that nearly 40% of websites disappear after 10 years, this will enforce a giant loss in collective memory.
View on Bluesky →
Originally reported by the-independent.com
Read the original article →Original headline: AI’s biggest casualty could be history itself