reddit.com via Reddit

Qwen3-35B sub-agent silently fails in production

agents inference open source local-llm agents failure-modes

Key insights

  • Qwen3-35B-A3B sub-agent failures present as plausible tool-call outputs, making errors invisible to orchestrators expecting explicit failure signals.
  • JSON outputs from the model misroute to its internal reasoning channel, causing silent data loss in multi-agent pipelines.
  • Community testing across multiple orchestration frameworks confirmed the failure patterns, suggesting the issue is structural rather than model-specific.

Why this matters

Multi-agent pipelines are rapidly becoming the default production architecture for AI systems, and silent failure propagation is a correctness problem that standard logging and completion-rate metrics won't surface. The four failure modes documented here mean an orchestrator can report 100% task completion while downstream outputs are systematically corrupted across the entire run. Any team deploying open-source models as sub-agents without per-call output validation is currently operating without a meaningful signal that their pipeline is working.

Summary

Running Qwen3-35B-A3B as a sub-agent on a single RTX 4090, a developer has documented four failure modes the orchestrator layer never sees. In solo use, model failures are obvious. In a pipeline, the model wraps those failures in plausible tool-call responses. The orchestrator logs success and continues. Corruption propagates silently. Essentially: (Alibaba Qwen, multi-agent pipeline builders) never designed for how failure semantics change when one model controls another. - JSON outputs misroute to the model's internal reasoning channel instead of the structured response field. - Context bleeds silently across sequential tasks, contaminating later instruction calls. - Hallucinated completions are repackaged as successful results rather than flagged errors. Multiple community replies confirmed the patterns across different orchestration frameworks, pointing to a structural gap beyond any single model.

Potential risks and opportunities

Risks

  • Production teams using Qwen3-35B-A3B in automated pipelines without per-call output validation may have already accumulated silently corrupted results that passed downstream quality checks undetected
  • Orchestration framework maintainers (LangGraph, CrewAI, AutoGen) face near-term pressure to ship sub-agent failure detection layers, adding latency and engineering cost to existing deployments
  • Organizations treating high orchestrator-level task-completion rates as a pipeline health signal may misread system correctness for months before corrupted outputs surface in user-facing products

Opportunities

  • Inference infrastructure providers (Together AI, Fireworks AI, Replicate) can differentiate by adding structured-output enforcement and sub-agent output validation at the API layer before orchestrators ever see a response
  • Observability vendors building for multi-agent systems (Langfuse, Arize AI, Weights and Biases) have a clear wedge: per-call semantic validation that catches the failure modes orchestrators currently miss entirely
  • Open-source contributors who ship a standardized sub-agent failure testing harness now could establish it as the default evaluation layer for multi-agent pipelines ahead of any framework-native solution

What we don't know yet

  • Whether Alibaba's Qwen team has reproduced the instruction-scope leakage behavior internally and whether it persists across newer Qwen3 variants released in early 2026
  • Which specific orchestration frameworks (LangGraph, CrewAI, AutoGen) were tested and whether any showed meaningfully better sub-agent failure isolation than others
  • Whether structured-output enforcement modes at the inference layer (JSON schema constraints, grammar-constrained decoding) mitigate the JSON misrouting failure mode