lemongravy.me via Reddit

Intel Arc Pro B70 Delivers 63 t/s on Qwen 3.6-35B

intel open source inference edge ai local-llm inference intel-gpu

Key insights

  • Intel Arc Pro B70 sustains 62-64 tokens/sec on Qwen 3.6-35B via SYCL, with prompt processing reaching 330-530 tokens/sec.
  • Reproducing the results requires a custom llama.cpp build, a seq_rm patch (PR #22534) for hybrid-model cache bugs, and Intel-specific driver flags.
  • A one-time ~27-second JIT compile on first container startup is the main latency cost; multi-turn context reuse then drops latency to milliseconds.

Why this matters

Intel offering a $949, 32GB VRAM GPU that sustains 62-64 tokens/sec on a 35-billion parameter model changes the economic calculus for teams that want local inference without Nvidia pricing. The engineering overhead documented by Kyran Menezes (custom builds, upstream patches, driver environment variables) reveals that the software ecosystem gap, not silicon performance, is now the primary obstacle to Intel GPU adoption for LLM inference. For AI practitioners evaluating non-Nvidia hardware for on-premise deployments, this guide maps exactly where Intel falls short of CUDA and what workarounds currently exist.

Summary

Kyran Menezes published a guide showing Intel's Arc Pro B70 ($949, 32GB VRAM) hitting 63+ t/s on Qwen 3.6-35B on an i5-12400F with DDR4 memory. Reaching those numbers required a custom llama.cpp build, a seq_rm patch (PR #22534) for a hybrid-model cache bug, Level Zero driver tuning, and CPU core pinning. SYCL, not Vulkan, routes through Intel's Level Zero driver and unlocks Xe Matrix Extensions. Essentially: (Kyran Menezes, Intel Arc) show a $949 GPU can run a 35B model at usable speeds. - Prompt processing: 330-530 t/s; sustained generation: 62-64 t/s. - First startup carries a ~27-second JIT compile. - Verdict: Intel Arc is maturing fast, not yet at parity with CUDA on stability or ecosystem breadth. The gap isn't raw performance; it's the engineering overhead to unlock it.

Potential risks and opportunities

Risks

  • Teams deploying this configuration in production face regression risk: the guide explicitly relies on unvetted commits and patches not yet in stable llama.cpp releases, meaning upstream merges could silently break performance or cache behavior.
  • Intel's oneAPI and Level Zero driver toolchain instability could widen the CUDA software gap rather than close it, especially if driver updates break the environment variable optimizations the guide depends on.
  • If Nvidia improves CUDA ecosystem tooling for the sub-$1000 GPU segment, Intel's current performance-per-dollar advantage in the 32GB VRAM space could shrink before Intel stabilizes its software stack.

Opportunities

  • Intel's oneAPI team has a public, documented gap list from this guide; merging the seq_rm patch (PR #22534) and stabilizing SYCL compiler flags are concrete steps that could convert early-adopter momentum into broader developer adoption.
  • Enterprise and government customers requiring 32GB VRAM local inference without Nvidia pricing now have a documented $949 reference architecture; system integrators offering Intel Arc-based inference servers gain a credibility anchor.
  • Containerized LLM deployment platforms and llama.cpp Docker maintainers could ship Intel SYCL-optimized images that bake in these optimizations, removing the primary barrier for non-expert Intel Arc users.

What we don't know yet

  • Whether the seq_rm patch (PR #22534) and other unvetted SYCL optimizations will land in a stable llama.cpp release, and on what timeline.
  • How the Arc Pro B70's 62-64 t/s on Qwen 3.6-35B compares numerically to Nvidia hardware at the same $949 price point; the article acknowledges a CUDA gap but provides no direct cross-hardware benchmarks.
  • Whether the SYCL-based setup is reproducible for non-experts, given the article's note that some users see better or more consistent performance with Vulkan on certain configurations.