huggingface.co web signal

Open-Weight LLMs Drop Tool Calls Silently Under JSON Schema

TL;DR

  • When JSON Schema and tool-calling constraints are both active, all tested open-weight models dropped tool invocation to 0%, even with explicit API-level enforcement.
  • Root cause is grammar-based token masking in SGLang/vLLM: JSON FSM states set tool-call token logits to negative infinity, making them unreachable by design.
  • Transparent Two-Pass Execution — running tool calls before schema constraints activate — restored tool invocation from 0% to 100% without model retraining.

Silent failures are the worst kind in production, and a paper from researchers at Focus AI Center and Nanjing University of Science and Technology documents one that has likely been hiding in deployed agent systems for a while. When open-weight LLMs are configured with both tool-calling and JSON Schema structured output constraints active simultaneously, they stop invoking tools entirely. Schema compliance stays high. The agent looks healthy. It is not.

The mechanism is specific and traceable. Inference frameworks like SGLang and vLLM compile JSON Schema constraints into grammar-based finite-state machines that apply a vocabulary mask at every decoding step, setting non-conforming token logits to negative infinity. For models in the Qwen family, tool calls are formatted as XML-style tags opening with the `<` character (U+003C). Because that character is not a valid token in any JSON FSM state, it is masked out universally throughout generation — the model's internal preference for calling a tool is overridden before sampling even happens.

The researchers tested seven model instances spanning 20B to 397B parameters, across multiple model families and deployment settings, using both SGLang and vLLM. Under the tool-only baseline condition, all models achieved a tool invocation rate of 100%. Under joint constraints, that rate dropped to 0% across all open-weight models, regardless of schema complexity, explicit prompting, or even API-level tool_choice="required" enforcement. The evaluated closed-source reference model (GPT-5.4-mini) maintained stable tool execution under all conditions. Multiple fine-tuned variants — including SFT and GRPO-trained versions — showed no improvement, because the mask operates after the model produces logits, leaving weight-level optimization no path to override it.

The proposed fix is called Transparent Two-Pass Execution: a first inference pass runs with tools enabled and schema constraints off, collects all tool results, then a second pass runs with schema constraints active and no tool execution. In testing this restored tool invocation from 0% to 100% while preserving schema compliance, with no model retraining required. The honest caveat is that it adds latency equal to one additional inference round plus tool execution time, which matters for latency-sensitive deployments. Code and data are available at the project repository.

What the paper does not settle is how broadly other open-weight model families with different tool-call token formats are affected, why GPT-5.4-mini appears immune, or whether the two-pass pattern introduces its own failure modes at scale. Those are real gaps. But the core finding — that evaluating tool use and structured output separately can mask a complete production failure when they run together — is the more important point for teams assembling agent pipelines today.