reddit.com via Reddit

r/LocalLLaMA: Developer Finds llama.cpp Default Pipeline Parallelism Wastes VRAM With No Throughput Gain — Compile Flag Recovers the Penalty at Zero Speed Cost

open source inference llama-cpp local-llm vram-optimization

Summary

A developer on r/LocalLLaMA reports that llama.cpp's default pipeline parallelism mode incurs significant VRAM overhead on multi-GPU setups while providing no measurable throughput improvement in testing. A build-time compile flag disables pipeline parallelism and recovers the VRAM cost with no speed penalty. The finding applies to all multi-GPU llama.cpp deployments and has drawn community discussion given that pipeline parallelism is enabled by default in the project.