ik_llama Tops Qwen3-27B Speed Test on 24GB RTX 3090
Key insights
- ik_llama hit 1,261 tok/s prefill and 72.9 tok/s decode on Qwen3.6-27B using IQ4_KS quantization on a single 24GB RTX 3090.
- ik_llama enabled a 156K context window on 24GB VRAM by combining IQ4_KS quantization with q8_0/q8_0 KV cache and MTP.
- ik_llama supports CPU-offloaded vision inference, adding multimodal capability on 24GB setups without additional GPU VRAM cost.
Why this matters
Consumer-grade 24GB GPUs represent the practical hardware ceiling for most individual developers and small teams, so cross-backend benchmarks at this tier are directly actionable for real deployment decisions. ik_llama's lead over llama.cpp signals ecosystem fragmentation where the default local inference tooling is no longer the fastest option, forcing practitioners to re-evaluate stack choices that many treat as settled. The combination of 156K context, MTP, and 72.9 tok/s decode fitting inside a single RTX 3090 shifts the cost-performance calculus for self-hosted 27B models closer to viable production use.
Summary
Running Qwen3.6-27B on a single RTX 3090 now has real four-way backend numbers. A r/LocalLLaMA community benchmark tested ik_llama, llama.cpp, BeeLlama, and vllm at 24GB VRAM using a ~5.9K-token prompt plus 1K output under identical hardware conditions.
ik_llama won by a clear margin: IQ4_KS quantization, q8_0/q8_0 KV cache, 156K context window, and MTP enabled pushed it to 1,261 tok/s prefill and 72.9 tok/s decode. llama.cpp placed second, with BeeLlama and vllm behind.
Essentially: (ik_llama, llama.cpp, BeeLlama, vllm) are now in direct competition for the 24GB consumer-card inference tier.
- ik_llama fits the full 27B model into 24GB with 156K context via IQ4_KS quant, outpacing the field on both prefill and decode speeds.
- ik_llama's CPU-offloaded vision option adds multimodal capability without consuming extra VRAM, noted by commenters as a practical edge.
With four backends benchmarked on identical hardware, the RTX 3090 is solidifying as a legitimate local deployment target for 27B-scale models.
Potential risks and opportunities
Risks
- llama.cpp-dependent projects (Ollama, LM Studio) face user migration pressure if ik_llama's speed advantage replicates across additional models and GPU SKUs beyond the RTX 3090
- Community benchmarks without strict variable controls (driver versions, thermal state, batch size, prompt structure) may lead practitioners to incorrect backend conclusions that hurt production latency
- vllm's weak showing on consumer 24GB hardware could accelerate its perception as a datacenter-only stack, narrowing its adoption base precisely as local inference grows fastest
Opportunities
- ik_llama's CPU-offloaded vision feature creates a differentiation path for multimodal local AI products targeting RTX 3090/4090 owners without requiring hardware upgrades
- Backend-agnostic inference frontends (Ollama, LM Studio, Jan) could capture ik_llama's performance gains as a drop-in option, lowering the barrier for non-CLI users to switch
- Alibaba's Qwen team can leverage this community benchmark as third-party validation that Qwen3.6-27B is optimized for consumer deployment, strengthening positioning against Meta's Llama lineup in the open-source developer segment
What we don't know yet
- Whether llama.cpp's quantization config and MTP settings in this test were tuned to the same degree as ik_llama's, or whether configuration asymmetry explains part of the gap
- Decode speeds for BeeLlama and vllm were not reported numerically in the summary, leaving their practical competitiveness against llama.cpp unquantified
- Whether ik_llama's 156K context and MTP combination maintains output quality parity with llama.cpp at equivalent quant levels, or trades accuracy for throughput
Originally reported by reddit.com
Read the original article →Original headline: r/LocalLLaMA: Qwen3.6-27B on RTX 3090 — First 4-Backend Shootout Across ik_llama, llama.cpp, BeeLlama, and vllm at 24GB VRAM With MTP