reddit.com via Reddit

DeepSeek-V4 Runs on $2,500 RTX 2080 Ti Cluster

deepseek inference open source local-inference quantization

Key insights

  • DeepSeek-V4 achieved 255 prefill tokens/sec on four RTX 2080 Ti cards totaling under $2,500 using W8A8 quantization.
  • Custom CUDA kernels targeting the 2018-era Turing architecture were required to reach viable inference throughput on legacy hardware.
  • The build challenges the assumption that frontier MoE models require H100 clusters or current-generation consumer GPUs.

Why this matters

The $2,500 price point for running a frontier MoE model locally resets expectations for what self-hosted AI infrastructure costs, directly undercutting cloud inference pricing models that assume hardware scarcity as a baseline. For AI founders and practitioners evaluating build-vs-buy decisions, this demonstrates that used-market GPU clusters with custom kernel work can now reach throughput thresholds that were previously gated behind expensive hardware procurement or cloud spend. The public release of kernel patches and quantization pipelines means this capability diffuses quickly across the open-source community, compressing the timeline before cost-competitive self-hosted inference becomes a default option rather than an edge case.

Summary

A community developer has run DeepSeek-V4, a frontier-class mixture-of-experts model, on four used RTX 2080 Ti cards for under $2,500 total hardware cost, reaching 255 prefill tokens per second through custom Turing-architecture kernel optimizations and W8A8 quantization. The technical lift here is substantial. The RTX 2080 Ti is a 2018-era Turing GPU with 11GB VRAM per card, well below the memory bandwidth and capacity of current-generation hardware. Getting a large MoE model to run efficiently required writing custom CUDA kernels targeting the Turing ISA and applying 8-bit weight, 8-bit activation quantization to fit the model across the four-card setup without catastrophic throughput degradation. Essentially: an independent developer, no institutional backing, demonstrated that H100 clusters are not a prerequisite for running DeepSeek-V4 at usable inference speeds. - 255 prefill tokens per second on a $2,500 rig is within range of practical use for many developer and research workloads. - Custom Turing kernels and W8A8 quantization were both necessary to achieve this; neither alone was sufficient. - Full hardware config, kernel patches, and benchmark data were shared publicly in the Reddit thread. The result shifts the cost floor for self-hosted frontier-class inference downward by at least an order of magnitude from what cloud pricing implies is necessary.

Potential risks and opportunities

Risks

  • If W8A8 quantization introduces task-specific quality regressions not captured in throughput benchmarks, developers deploying this stack in production could ship degraded outputs without realizing it.
  • Custom Turing kernels without upstream support create a maintenance burden that could leave this setup stranded when future DeepSeek-V4 updates or quantization tooling changes break compatibility.
  • Widespread replication of this setup on the used GPU market could spike RTX 2080 Ti prices, eroding the sub-$2,500 cost advantage within months as supply tightens.

Opportunities

  • Used GPU resellers and marketplaces (eBay, GPU server refurbishers) will likely see demand spikes for RTX 2080 Ti inventory as developers attempt to replicate this build.
  • Quantization tooling vendors and open-source projects (llama.cpp, vLLM, AutoAWQ) can integrate the Turing-specific kernel optimizations to lower the barrier for legacy-GPU inference without requiring custom kernel work.
  • Cloud inference providers face renewed pricing pressure as self-hosted frontier inference becomes demonstrably viable at consumer hardware costs, creating an opening for providers who offer transparent per-token pricing tied to actual hardware cost rather than scarcity premiums.

What we don't know yet

  • Output quality degradation from W8A8 quantization relative to full-precision DeepSeek-V4 was not benchmarked in the shared thread.
  • Whether the custom Turing kernels are stable enough for sustained multi-day inference workloads, or only validated for benchmark runs, remains unaddressed.
  • Decode throughput (tokens/sec for generation, not just prefill) was not prominently reported, which matters more for interactive use cases than prefill speed alone.