arxiv.org web signal

Caprio, Corenflos, Power prove CAVI contracts in Wasserstein

TL;DR

  • The paper establishes Wasserstein contraction of coordinate ascent variational inference without assuming global strong log-concavity of the target.
  • The conditions are a functional smoothness of the optimality maps plus a transportation-information inequality at their fixed points.
  • Covered models include Ising and Curie-Weiss, Bayesian Gaussian mixtures, high-dimensional Bayesian probit regression, and Pólya-Gamma logistic regression.

Coordinate ascent variational inference is one of those workhorse algorithms that shows up in almost every applied Bayesian pipeline, and yet its convergence behaviour has never had particularly clean guarantees outside the friendly log-concave case. A new arXiv paper from Rocco Caprio, Adrien Corenflos and Sam Power sets out to fill some of that gap, at least in the Wasserstein-distance sense.

The technical claim is that CAVI contracts in Wasserstein distance under two conditions the authors identify, a functional smoothness condition of the optimality maps and a transportation-information inequality at their fixed points. Neither of those is going to fit on a slide, but the substantive point is what they replace. Earlier contraction results for CAVI typically leaned on global strong log-concavity, which excludes most of the interesting models people actually run. This paper's approach avoids that assumption and gives what the authors describe as local convergence on smooth, non-smooth, and discrete manifolds, including within the context of data augmentation.

That last clause is where the applied relevance shows up. The abstract lists a set of models the framework covers, Markov random field models like Ising and Curie-Weiss, Bayesian Gaussian mixtures, high-dimensional Bayesian probit regression, and logistic regression with Pólya-Gamma random variables. Those are exactly the models where CAVI is used routinely but where formal justification has been thin, and the authors note their results are novel for many of them.

The honest caveat is that this is a theory paper and I am reading it from the arXiv abstract rather than the proofs. What the abstract does not give you is the contraction rate, how it scales with dimension, or how the transportation-information inequality is actually checked on a discrete model like Ising in practice. There is also no signal about whether the paper carries empirical experiments. But the direction, formal guarantees for a heavily used algorithm on a much broader class of models, is the part practitioners doing probabilistic ML should be watching.

Shared on Bluesky by 2 AI experts