reddit.com via Reddit

ik_llama Tops Qwen3-27B Speed Test on 24GB RTX 3090

Key insights

  • ik_llama hit 1,261 tok/s prefill and 72.9 tok/s decode on Qwen3.6-27B using IQ4_KS quantization on a single 24GB RTX 3090.
  • ik_llama enabled a 156K context window on 24GB VRAM by combining IQ4_KS quantization with q8_0/q8_0 KV cache and MTP.
  • ik_llama supports CPU-offloaded vision inference, adding multimodal capability on 24GB setups without additional GPU VRAM cost.

Why this matters

Consumer-grade 24GB GPUs represent the practical hardware ceiling for most individual developers and small teams, so cross-backend benchmarks at this tier are directly actionable for real deployment decisions. ik_llama's lead over llama.cpp signals ecosystem fragmentation where the default local inference tooling is no longer the fastest option, forcing practitioners to re-evaluate stack choices that many treat as settled. The combination of 156K context, MTP, and 72.9 tok/s decode fitting inside a single RTX 3090 shifts the cost-performance calculus for self-hosted 27B models closer to viable production use.

Summary

Running Qwen3.6-27B on a single RTX 3090 now has real four-way backend numbers. A r/LocalLLaMA community benchmark tested ik_llama, llama.cpp, BeeLlama, and vllm at 24GB VRAM using a ~5.9K-token prompt plus 1K output under identical hardware conditions. ik_llama won by a clear margin: IQ4_KS quantization, q8_0/q8_0 KV cache, 156K context window, and MTP enabled pushed it to 1,261 tok/s prefill and 72.9 tok/s decode. llama.cpp placed second, with BeeLlama and vllm behind. Essentially: (ik_llama, llama.cpp, BeeLlama, vllm) are now in direct competition for the 24GB consumer-card inference tier. - ik_llama fits the full 27B model into 24GB with 156K context via IQ4_KS quant, outpacing the field on both prefill and decode speeds. - ik_llama's CPU-offloaded vision option adds multimodal capability without consuming extra VRAM, noted by commenters as a practical edge. With four backends benchmarked on identical hardware, the RTX 3090 is solidifying as a legitimate local deployment target for 27B-scale models.

Potential risks and opportunities

Risks

  • llama.cpp-dependent projects (Ollama, LM Studio) face user migration pressure if ik_llama's speed advantage replicates across additional models and GPU SKUs beyond the RTX 3090
  • Community benchmarks without strict variable controls (driver versions, thermal state, batch size, prompt structure) may lead practitioners to incorrect backend conclusions that hurt production latency
  • vllm's weak showing on consumer 24GB hardware could accelerate its perception as a datacenter-only stack, narrowing its adoption base precisely as local inference grows fastest

Opportunities

  • ik_llama's CPU-offloaded vision feature creates a differentiation path for multimodal local AI products targeting RTX 3090/4090 owners without requiring hardware upgrades
  • Backend-agnostic inference frontends (Ollama, LM Studio, Jan) could capture ik_llama's performance gains as a drop-in option, lowering the barrier for non-CLI users to switch
  • Alibaba's Qwen team can leverage this community benchmark as third-party validation that Qwen3.6-27B is optimized for consumer deployment, strengthening positioning against Meta's Llama lineup in the open-source developer segment

What we don't know yet

  • Whether llama.cpp's quantization config and MTP settings in this test were tuned to the same degree as ik_llama's, or whether configuration asymmetry explains part of the gap
  • Decode speeds for BeeLlama and vllm were not reported numerically in the summary, leaving their practical competitiveness against llama.cpp unquantified
  • Whether ik_llama's 156K context and MTP combination maintains output quality parity with llama.cpp at equivalent quant levels, or trades accuracy for throughput