SigmaScale adds learned scaling to SVD-based LLM compression
TL;DR
- SigmaScale learns auxiliary diagonal scaling matrices to improve truncated SVD compression of large language model weights.
- The authors test the method on Llama 3.1 8B Instruct and Qwen3-8B across perplexity and zero-shot benchmarks.
- They argue learned scaling reduces an 'effective-rank entropy' that is strongly correlated with compression loss.
Compressing an 8B language model without breaking it is one of those problems where the headline techniques, quantization and pruning, get most of the attention while the supporting cast, truncated SVD on weight matrices, quietly does a lot of work. A new arxiv preprint, SigmaScale, is a contribution to that supporting cast.
The idea is small and focused. Truncated SVD compresses a weight matrix by keeping only its top singular components, and prior work has shown that pre-multiplying by a diagonal scaling matrix before the decomposition has a large effect on how much accuracy survives. The usual move is to derive that scaling matrix analytically. SigmaScale's pitch is to learn it instead, optimizing two sets of vectors that define diagonal row and column scaling transformations under an activation-aware loss.
The authors evaluate on Llama 3.1 8B Instruct and Qwen3-8B, and report that the method is competitive with closely related state-of-the-art SVD-based compression approaches across perplexity and zero-shot benchmarks. They also argue that the learned scaling reduces what they call the effective-rank entropy of the weight matrices, and that this reduction is strongly correlated with the compression loss they measure.
The honest caveat is that the public abstract does not put numbers on any of this. 'Competitive with state of the art' could mean a meaningful improvement or essentially a tie, and without the full result tables and ideally released code you cannot tell which. The activation-aware loss also implies a calibration step that the abstract does not characterize, and there is no statement about whether the scheme composes with the quantization that most production inference pipelines now stack on top.
What is worth watching is the direction. If learning the scaling behaves more reliably than deriving it, that is a small but reusable building block for anyone trying to fit Llama 3.1 8B or Qwen3-8B into memory-constrained settings, and the kind of trick that tends to show up next inside the open-source inference stacks.
Shared on Bluesky by 2 AI experts
Originally reported by arxiv.org
Read the original article →Original headline: SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices