Qwen3-35B sub-agent silently fails in production
Key insights
- Qwen3-35B-A3B sub-agent failures present as plausible tool-call outputs, making errors invisible to orchestrators expecting explicit failure signals.
- JSON outputs from the model misroute to its internal reasoning channel, causing silent data loss in multi-agent pipelines.
- Community testing across multiple orchestration frameworks confirmed the failure patterns, suggesting the issue is structural rather than model-specific.
Why this matters
Multi-agent pipelines are rapidly becoming the default production architecture for AI systems, and silent failure propagation is a correctness problem that standard logging and completion-rate metrics won't surface. The four failure modes documented here mean an orchestrator can report 100% task completion while downstream outputs are systematically corrupted across the entire run. Any team deploying open-source models as sub-agents without per-call output validation is currently operating without a meaningful signal that their pipeline is working.
Summary
Running Qwen3-35B-A3B as a sub-agent on a single RTX 4090, a developer has documented four failure modes the orchestrator layer never sees.
In solo use, model failures are obvious. In a pipeline, the model wraps those failures in plausible tool-call responses. The orchestrator logs success and continues. Corruption propagates silently.
Essentially: (Alibaba Qwen, multi-agent pipeline builders) never designed for how failure semantics change when one model controls another.
- JSON outputs misroute to the model's internal reasoning channel instead of the structured response field.
- Context bleeds silently across sequential tasks, contaminating later instruction calls.
- Hallucinated completions are repackaged as successful results rather than flagged errors.
Multiple community replies confirmed the patterns across different orchestration frameworks, pointing to a structural gap beyond any single model.
Potential risks and opportunities
Risks
- Production teams using Qwen3-35B-A3B in automated pipelines without per-call output validation may have already accumulated silently corrupted results that passed downstream quality checks undetected
- Orchestration framework maintainers (LangGraph, CrewAI, AutoGen) face near-term pressure to ship sub-agent failure detection layers, adding latency and engineering cost to existing deployments
- Organizations treating high orchestrator-level task-completion rates as a pipeline health signal may misread system correctness for months before corrupted outputs surface in user-facing products
Opportunities
- Inference infrastructure providers (Together AI, Fireworks AI, Replicate) can differentiate by adding structured-output enforcement and sub-agent output validation at the API layer before orchestrators ever see a response
- Observability vendors building for multi-agent systems (Langfuse, Arize AI, Weights and Biases) have a clear wedge: per-call semantic validation that catches the failure modes orchestrators currently miss entirely
- Open-source contributors who ship a standardized sub-agent failure testing harness now could establish it as the default evaluation layer for multi-agent pipelines ahead of any framework-native solution
What we don't know yet
- Whether Alibaba's Qwen team has reproduced the instruction-scope leakage behavior internally and whether it persists across newer Qwen3 variants released in early 2026
- Which specific orchestration frameworks (LangGraph, CrewAI, AutoGen) were tested and whether any showed meaningfully better sub-agent failure isolation than others
- Whether structured-output enforcement modes at the inference layer (JSON schema constraints, grammar-constrained decoding) mitigate the JSON misrouting failure mode
Originally reported by reddit.com
Read the original article →Original headline: r/LocalLLaMA: Developer Running Qwen3.6-35B-A3B as Sub-Agent Documents Four Failure Modes That Are Invisible to the Orchestrator