reddit.com via Reddit May 18th 2026

Qwen 3.6 27B Q8 Runs on Four Nvidia RTX A4000 Cards

open source inference local-llm inference hardware-benchmarks

Key insights

Four Nvidia RTX A4000 16GB cards pool 64GB total VRAM to run Qwen 3.6 27B Q8 via llama.cpp, the first prosumer workstation multi-GPU benchmark.
Multi-Token Prediction throughput gains documented on RTX 3090 hardware have not yet been confirmed on A4000 professional multi-GPU configurations.
Tensor-split configuration across four matched 16GB professional GPUs is the primary unresolved tuning variable in this benchmark setup.

Why this matters

Llama.cpp multi-GPU tensor-split performance on professional workstation hardware represents an underexplored inference tier sitting between consumer gaming GPUs and datacenter accelerators, and this benchmark is the first structured data point in that bracket. Teams considering on-prem LLM deployment on reclaimed professional workstations now have a concrete 27B Q8 reference to anchor procurement and configuration decisions rather than extrapolating from dissimilar consumer hardware. Whether MTP yields meaningful throughput gains on multi-GPU setups with lower per-card VRAM than RTX 3090 configurations will determine whether prosumer professional-GPU clusters are viable for production local inference at this model scale.

Summary

Qwen 3.6 27B Q8 is now benchmarked across four Nvidia RTX A4000 16GB cards on a Lenovo ThinkStation P3 running llama.cpp with Multi-Token Prediction, the first community data point for this prosumer professional-GPU setup. Prior community data covered RTX 3090, RTX 2060, Strix Halo, and dual RTX 5060 Ti configurations. Four A4000s pool 64GB total VRAM for the 27B Q8 model, with tensor-split tuning as the central variable distinguishing this configuration from fewer high-VRAM consumer alternatives. Essentially: (r/LocalLLaMA, Qwen) the test case is whether matched professional 16GB cards replicate MTP throughput gains already documented on RTX 3090 hardware. - MTP behavior on A4000 multi-GPU setups is unconfirmed relative to existing RTX 3090 baselines - Whether four 16GB professional cards outperform fewer high-VRAM consumer GPUs remains the unresolved comparison For teams repurposing workstation hardware for local inference, this is the 27B Q8 reference that previously did not exist.

Potential risks and opportunities

Risks

Developers who invest in four-A4000 workstation builds based on this single community benchmark may find MTP gains absent on professional 16GB cards, making the configuration cost-inefficient versus fewer high-VRAM consumer alternatives
Teams relying on current llama.cpp tensor-split defaults for A4000 quad-card setups risk performance regressions as MTP support matures and default scheduling changes in future builds
Single-benchmark procurement decisions for Qwen 3.6 27B Q8 on A4000 hardware carry configuration risk if tensor-split parameters are not tuned per workload, since the community data point represents one developer's setup rather than a validated deployment baseline

Opportunities

Enterprises with existing Lenovo ThinkStation P3, Dell Precision, or HP Z-series workstations can now benchmark 27B-class models against a documented configuration without additional hardware procurement
Hardware resellers and system integrators targeting on-prem AI can position refurbished RTX A4000 quad-card workstations as a cost-efficient 64GB VRAM inference tier with a community-validated llama.cpp baseline
Llama.cpp contributors can use this A4000 data point to optimize tensor-split and MTP scheduling specifically for professional 16GB multi-GPU configurations, a segment previously absent from optimization test matrices

What we don't know yet

Actual tokens-per-second figures with MTP enabled versus disabled on the four-A4000 setup were not reported in the benchmark
Whether the ThinkStation P3 PCIe topology limits tensor-split performance relative to RTX 3090 single-card or dual-card configurations is unaddressed
Cost-per-token comparison between four RTX A4000 cards and alternatives such as dual RTX 3090 or single RTX 4090 was not included

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Qwen 3.6 27B Q8 Benchmarked on Four Nvidia RTX A4000 16GB Cards With Llama.cpp and MTP Enabled — First Prosumer Multi-GPU A4000 Data Point