reddit.com via Reddit May 18th 2026

ik_llama Tops Qwen3-27B Speed Test on 24GB RTX 3090

open source inference local-llm

Key insights

ik_llama hit 1,261 tok/s prefill and 72.9 tok/s decode on Qwen3.6-27B using IQ4_KS quantization on a single 24GB RTX 3090.
ik_llama enabled a 156K context window on 24GB VRAM by combining IQ4_KS quantization with q8_0/q8_0 KV cache and MTP.
ik_llama supports CPU-offloaded vision inference, adding multimodal capability on 24GB setups without additional GPU VRAM cost.

Why this matters

Consumer-grade 24GB GPUs represent the practical hardware ceiling for most individual developers and small teams, so cross-backend benchmarks at this tier are directly actionable for real deployment decisions. ik_llama's lead over llama.cpp signals ecosystem fragmentation where the default local inference tooling is no longer the fastest option, forcing practitioners to re-evaluate stack choices that many treat as settled. The combination of 156K context, MTP, and 72.9 tok/s decode fitting inside a single RTX 3090 shifts the cost-performance calculus for self-hosted 27B models closer to viable production use.

Summary

Running Qwen3.6-27B on a single RTX 3090 now has real four-way backend numbers. A r/LocalLLaMA community benchmark tested ik_llama, llama.cpp, BeeLlama, and vllm at 24GB VRAM using a ~5.9K-token prompt plus 1K output under identical hardware conditions. ik_llama won by a clear margin: IQ4_KS quantization, q8_0/q8_0 KV cache, 156K context window, and MTP enabled pushed it to 1,261 tok/s prefill and 72.9 tok/s decode. llama.cpp placed second, with BeeLlama and vllm behind. Essentially: (ik_llama, llama.cpp, BeeLlama, vllm) are now in direct competition for the 24GB consumer-card inference tier. - ik_llama fits the full 27B model into 24GB with 156K context via IQ4_KS quant, outpacing the field on both prefill and decode speeds. - ik_llama's CPU-offloaded vision option adds multimodal capability without consuming extra VRAM, noted by commenters as a practical edge. With four backends benchmarked on identical hardware, the RTX 3090 is solidifying as a legitimate local deployment target for 27B-scale models.

Potential risks and opportunities

Risks

llama.cpp-dependent projects (Ollama, LM Studio) face user migration pressure if ik_llama's speed advantage replicates across additional models and GPU SKUs beyond the RTX 3090
Community benchmarks without strict variable controls (driver versions, thermal state, batch size, prompt structure) may lead practitioners to incorrect backend conclusions that hurt production latency
vllm's weak showing on consumer 24GB hardware could accelerate its perception as a datacenter-only stack, narrowing its adoption base precisely as local inference grows fastest

Opportunities

ik_llama's CPU-offloaded vision feature creates a differentiation path for multimodal local AI products targeting RTX 3090/4090 owners without requiring hardware upgrades
Backend-agnostic inference frontends (Ollama, LM Studio, Jan) could capture ik_llama's performance gains as a drop-in option, lowering the barrier for non-CLI users to switch
Alibaba's Qwen team can leverage this community benchmark as third-party validation that Qwen3.6-27B is optimized for consumer deployment, strengthening positioning against Meta's Llama lineup in the open-source developer segment

What we don't know yet

Whether llama.cpp's quantization config and MTP settings in this test were tuned to the same degree as ik_llama's, or whether configuration asymmetry explains part of the gap
Decode speeds for BeeLlama and vllm were not reported numerically in the summary, leaving their practical competitiveness against llama.cpp unquantified
Whether ik_llama's 156K context and MTP combination maintains output quality parity with llama.cpp at equivalent quant levels, or trades accuracy for throughput

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Qwen3.6-27B on RTX 3090 — First 4-Backend Shootout Across ik_llama, llama.cpp, BeeLlama, and vllm at 24GB VRAM With MTP