Kog ships Laneformer 2B with 3,000 tokens/sec on MI300X
TL;DR
- Kog released Laneformer 2B, a 2.3B-parameter instruction-tuned coding model built around decoding speed rather than benchmark score.
- The team reports 3,000 output tokens/s on 8× AMD MI300X and 2,100 on 8× NVIDIA H200 at FP16, batch size 1.
- Laneformer 2B scores 45.1% on HumanEval+ and 51.6% on MBPP+ in greedy decoding, with sliding-window attention on 10 of 15 layers.
Most small coding model releases get judged on benchmarks alone, but the more interesting thing about Kog's Laneformer 2B is that the company designed the model architecture around decoding speed rather than treating speed as a downstream optimization problem.
The team writes on Hugging Face that the 2.3B-parameter instruction-tuned model reaches 3,000 output tokens per second per request on 8× AMD MI300X, and 2,100 output tokens per second per request on 8× NVIDIA H200, using FP16, batch size 1, and no speculative decoding. They describe their approach as a "lane-structured Transformer architecture for high-speed single-request decoding," built around what they call Delayed Tensor Parallelism, a mechanism meant to hide the GPU communication costs they would otherwise pay at every layer. The conventional pieces around it are conservative: 15 layers with sliding-window attention on 10 of them, and grouped-query attention with 32 query heads and 16 key/value heads sharded evenly across 8 lanes.
For coding quality, Laneformer 2B reports 45.1% on HumanEval+ and 51.6% on MBPP+ under greedy decoding. Pre-training ran on 24 nodes of 8 H100s each, for 192 H100 GPUs in total. The model's authors on the post are Morgan Giraud, Gauthier Tallec, and Gaël Delalleau.
For practitioners building interactive coding agents, the interesting line is the single-request latency at batch size 1. That is the regime where most usable assistants actually run, and where throughput-oriented benchmarks underrate what users feel. If those numbers hold up under independent testing, MI300X starts to look like a real inference option rather than a curiosity.
The honest caveat is in the post itself. The headline throughput is from Kog's own public KIE preview, not a neutral runtime, so you cannot just drop the checkpoint into vLLM and expect the same numbers. What the reporting does not give you is any third-party reproduction of the throughput claim, a head-to-head against other 2B coders on the same hardware, or any indication of how the lane structure behaves once batch sizes go up and speculative decoding is turned on. Take the specifics as reported, not settled.
If even half of that decoding speed survives outside Kog's own stack, the read is that small specialised models plus heavy co-design with the inference engine is becoming a competitive path for the parts of the agent stack where every millisecond shows up in the UI.
Shared on Bluesky by 1 AI expert
Originally reported by huggingface.co
Read the original article →Original headline: Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine