Agent Fail Museum catalogs AI agent failure patterns
Key insights
- Agent-fail-museum documents AI agent failure patterns that reappear after code changes and model upgrades, not just isolated one-off bugs.
- Cataloged failure modes include context window overflows causing silent hallucinations and stale state triggering incorrect tool selection.
- The archive uses reproducible examples to surface cross-team failure patterns typically siloed within individual engineering organizations.
Why this matters
AI agent failure patterns like silent hallucinations from context overflow and wrong tool selection from stale state are repeating across organizations with no shared vocabulary or database to compare against. A public failure taxonomy changes the economics of debugging: teams can match symptoms to documented patterns rather than re-deriving root causes after every model upgrade or deployment change. If the archive gains adoption, it becomes infrastructure for AI agent reliability engineering, shaping how teams design observability, test suites, and upgrade runbooks.
Summary
A developer community launched agent-fail-museum.vercel.app, a public archive of AI agent failure patterns that recur across code changes, model upgrades, and deployments.
The project documents systemic failure modes with reproducible examples meant to outlast individual post-mortems. Three categories dominate the early catalog: context window overflows, stale state, and loose instruction overinterpretation.
Essentially: (community developers via Reddit) the first public cross-org failure taxonomy for AI agents.
- Context window overflows cause silent hallucinations rather than surfaced errors, bypassing standard alerting.
- Stale state triggers wrong tool selection and the pattern persists through model upgrades.
- Loose instructions are overinterpreted in consistent, now-cataloged ways across teams.
Agent failure knowledge has been org-local by default; this project is a structural attempt to change that.
Potential risks and opportunities
Risks
- Teams that treat the archive as a complete checklist may skip root-cause analysis for failure modes not yet cataloged, creating false confidence in agent reliability assessments.
- Community-curated patterns without validation standards could propagate incorrect root-cause attributions across engineering orgs, hardening flawed mental models at scale.
- If the Vercel-hosted project goes unmaintained or the domain lapses, the archive becomes a dead reference cited in post-mortems with no canonical source or migration path.
Opportunities
- AI agent observability vendors (Langfuse, Arize AI, Honeycomb) could align product roadmaps to cataloged failure modes, gaining credibility with engineering teams already referencing the archive.
- Framework maintainers (LangChain, LlamaIndex, CrewAI) could reference documented patterns to justify architectural decisions in changelogs and roadmaps, accelerating trust with enterprise buyers.
- Enterprise AI platform teams could build internal test suites targeting known failure patterns before model upgrades, using the archive as a structured regression checklist to reduce upgrade-related incidents.
What we don't know yet
- How many failure patterns are documented at launch, and which organizations or teams have contributed reproducible examples versus only anecdotal reports.
- Whether cataloged patterns have been validated across multiple model providers (OpenAI, Anthropic, Google Gemini) or are based primarily on a single stack.
- How the project plans to maintain curation and reproducibility standards as community submissions scale, given no formal moderation structure was announced.
Originally reported by vercel.app
Read the original article →Original headline: r/artificial: 'Agent Fail Museum' — Public Archive of AI Agent Failure Patterns That Persist Across Code Changes and Model Upgrades