r/LocalLLaMA: llama.cpp b9455 Ships Multi-GPU SM Tensor KV Cache Fix — Critical Quantization Bug Affecting Multi-GPU Inference Now Resolved
Summary
llama.cpp release b9455 merges a fix for KV cache quantization incorrectness on multi-GPU configurations using the --sm tensor flag, a setting required for spreading inference layers across multiple GPUs. Contributor JohannesGaessler implemented proper SM tensor KV cache pipeline support, resolving silent cache corruption that produced incorrect outputs without throwing errors. Users running quantized KV cache inference across multi-GPU setups are advised to update immediately; single-GPU users are unaffected.
Originally reported by reddit.com
Read the original article →Original headline: r/LocalLLaMA: llama.cpp b9455 Ships Multi-GPU SM Tensor KV Cache Fix — Critical Quantization Bug Affecting Multi-GPU Inference Now Resolved