reddit.com via Reddit June 1st 2026

r/LocalLLaMA: llama.cpp b9455 Ships Multi-GPU SM Tensor KV Cache Fix — Critical Quantization Bug Affecting Multi-GPU Inference Now Resolved

open source inference open-source inference multi-gpu

Summary

llama.cpp release b9455 merges a fix for KV cache quantization incorrectness on multi-GPU configurations using the --sm tensor flag, a setting required for spreading inference layers across multiple GPUs. Contributor JohannesGaessler implemented proper SM tensor KV cache pipeline support, resolving silent cache corruption that produced incorrect outputs without throwing errors. Users running quantized KV cache inference across multi-GPU setups are advised to update immediately; single-GPU users are unaffected.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: llama.cpp b9455 Ships Multi-GPU SM Tensor KV Cache Fix — Critical Quantization Bug Affecting Multi-GPU Inference Now Resolved