reddit.com via Reddit

r/LocalLLaMA: llama.cpp b9455 Ships Multi-GPU SM Tensor KV Cache Fix — Critical Quantization Bug Affecting Multi-GPU Inference Now Resolved

open source inference open-source inference multi-gpu

Summary

llama.cpp release b9455 merges a fix for KV cache quantization incorrectness on multi-GPU configurations using the --sm tensor flag, a setting required for spreading inference layers across multiple GPUs. Contributor JohannesGaessler implemented proper SM tensor KV cache pipeline support, resolving silent cache corruption that produced incorrect outputs without throwing errors. Users running quantized KV cache inference across multi-GPU setups are advised to update immediately; single-GPU users are unaffected.