github.com via Hacker News July 3rd 2026

James O'Beirne publishes local-LLM build guide for July 2026

open source deepseek inference ai-business

TL;DR

James O'Beirne's local-llm repository, dated July 2026, publishes a two-tier build: Qwen3.6-27B on a ~$2k budget and GLM-5.2 on a ~$40k rig.
The $40k tier centers on four NVIDIA RTX PRO 6000 Blackwell Workstation GPUs with 384GB VRAM total, quoted at roughly $46,000.
The serving stack is vLLM in Docker with opencode, quoting ~80 tokens per second at 240k context on the GLM configuration.

James O'Beirne's local-llm repository, dated July 2026, reads less like a benchmark rant and more like a bill of materials. Two price points, two recommended checkpoints. On roughly a $2k budget, he points at Qwen3.6-27B. On roughly a $40k budget, he points at GLM-5.2-Int8Mix-NVFP4-REAP-594B, dated 2026-07 in his own table. A note at the top says nothing in this README aside from the tables was written by AI, which sets the tone.

The high-end rig is where the numbers get concrete. The base system he lists totals $5,587: an ASRock Rack ROMED8-2T motherboard, an AMD EPYC Milan 7313P at 16 cores and 3.0 GHz, 128GB of DDR4 ECC across eight sticks, dual 1700W Super Flower PSUs, a 4TB M.2 boot drive plus two 8TB M.2 drives for weights, and an open-frame case. On top of that sit four NVIDIA RTX PRO 6000 Blackwell Workstation cards at 96GB each, 384GB of VRAM in total, quoted at roughly $46,000. A Microchip Switchtec PM40100 PCIe switch from c-payne knits the GPUs together at Gen4 line rate.

The serving stack is vLLM in Docker, with opencode as the inference interface, a Telegram bot for chat, a private Gitea instance, and search wired through searXNG and the Kagi API. On the GLM configuration, the repo quotes roughly 80 tokens per second at 240k context. There is also a whisper-large-v3 speech-to-text model that fits in about 11GB of VRAM alongside everything else.

The honest caveat is that this is one person's setup, and the throughput figure is quoted without a benchmark harness or a multi-user load test in what I retrieved. Model picks dated 2026-07 are a snapshot, not a plan, and a Blackwell workstation build assumes supply and pricing that can move quickly. Power, cooling, and noise for a quad-GPU open-frame rig are not the questions the README seems set up to answer.

What is useful, if you are the practitioner who has been waiting for someone to write down the exact motherboard and PCIe switch, is that it is now written down. The $2k Qwen path is the more broadly reachable one; the $40k GLM rig is a reference point, not a floor.

Originally reported by github.com

Read the original article →

Original headline: Jamesob Publishes 'Guide to Running SOTA LLMs Locally' - Documents 2026 Local-Model Stack Across Qwen 3.6 27B, GLM-5.2, and DeepSeek V4-Pro on Consumer Hardware