SK

The Gibbs distribution and general Bayes

Date written Jun 11, 2020
Filed under ML & Stats in math

Contents

The Gibbs distribution keeps coming up everywhere in machine learning. In this post, I'm going to focus on a general optimization problem and realize its various connections. Towards the end, I'm going to touch upon general Bayes which goes beyond classic posterior inference.

Optimization problem

The optimization problem we are interested in is formulated as below -

argminq(θ)P {J(q(θ);,D,p(θ))=Eq(θ)[(D;θ)]+KL[q(θ)  p(θ)]}\underset{q(\theta) \in \mathcal{P}}{\text{argmin}}{}~ \left\{ \mathcal{J}(q(\theta); \ell, \mathcal{D}, p(\theta)) = \mathbb{E}_{q(\theta)}\left[ \ell(\mathcal{D}; \theta) \right] + \mathcal{KL}\left[ q(\theta){}~ \big|\big|{}~ p(\theta) \right] \right\}

We aim to find a distribution q(θ)q(\theta) that minimizes the function J\mathcal{J} over a family of distributions P\mathcal{P} regularized by the prior p(θ)p(\theta) via the KL\mathcal{KL}-divergence. \ell is a loss function to our liking.

Optimal Gibbs distribution

We show two proofs for the fact the distribution qq that minimizes the optimization problem above is a specific form of a Gibbs distribution.

Variational Calculus

The variational calculus approach to solving would be to simply solve a constrained optimization problem. The only constraint we have is that q(θ)q(\theta) is a probability distribution.

argminq(θ)P J(q(θ);,D,p(θ))s.t. q(θ)dθ=1\begin{aligned} \underset{q(\theta) \in \mathcal{P}}{\text{argmin}}&{}~ \mathcal{J}(q(\theta); \ell, \mathcal{D}, p(\theta)) \\ \text{s.t.}&{}~ \int q(\theta) d\theta = 1 \end{aligned}

Converting this to an unconstrained optimization problem via the method of Lagrange multipliers [2], for a multiplier λR\lambda \in \mathbb{R}, we have

J(q(θ);,D,p(θ))+λ(q(θ)dθ1)\mathcal{J}(q(\theta); \ell, \mathcal{D}, p(\theta)) + \lambda \left(\int q(\theta) d\theta - 1 \right)

We take the functional derivatives a w.r.t q(θ)q(\theta) and the fact that this should be equal to zero for a minimum.

(D,θ)+1+logq(θ)logp(θ)+λ=0\ell(\mathcal{D}, \theta) + 1 + \log{q(\theta)} - \log{p(\theta)} + \lambda = 0q(θ)=p(θ)exp{1λ(D;θ)}q(\theta) = p(\theta) \exp{\left\{-1 - \lambda - \ell(\mathcal{D}; \theta)\right\}}

To eliminate λ\lambda, we put it back into the constraint and get

exp{1+λ}=p(θ)exp{(D;θ)}dθ\exp{\left\{1 + \lambda\right\}} = \int p(\theta) \exp{\left\{- \ell(\mathcal{D}; \theta)\right\}} d\theta

We get the complete form for our optimal Gibbs distribution

q(θ)=p(θ)exp{(D;θ)}p(θ)exp{(D;θ)}dθq(\theta) = \frac{p(\theta) \exp{\left\{- \ell(\mathcal{D}; \theta)\right\}}}{\int p(\theta^\prime) \exp{\left\{- \ell(\mathcal{D}; \theta^\prime)\right\}} d\theta^\prime}

KL-divergence minimizer

Alternatively, we can also make some manipulations to our original object of interest as follows

Eq(θ)[(D;θ)]+KL[q(θ)  p(θ)]Eq(θ)[logexp{(D;θ)}+logq(θ)p(θ)]Eq(θ)[logq(θ)p(θ)exp{(D;θ)}/Z]\begin{aligned} \mathbb{E}_{q(\theta)}\left[ \ell(\mathcal{D}; \theta) \right] + \mathcal{KL}\left[ q(\theta){}~ \big|\big|{}~ p(\theta) \right] \\ \mathbb{E}_{q(\theta)}\left[ \log{\exp{\left\{ \ell(\mathcal{D}; \theta) \right\}}} + \log{\frac{q(\theta)}{p(\theta)}} \right] \\ \mathbb{E}_{q(\theta)}\left[ \log{\frac{q(\theta)}{p(\theta)\exp{\left\{ -\ell(\mathcal{D}; \theta) \right\}}/Z}} \right] \end{aligned}

Note that we have sneaked in a normalizer ZZ as it does not change the optimization problem. This ensure the denominator remains a probability distribution and allows us to arrive at the following expression,

KL(q(θ)  p(θ)exp{(D;θ)}/Z)\mathcal{KL}\left( q(\theta) ~\big|\big|~ p(\theta)\exp{\left\{ -\ell(\mathcal{D}; \theta) \right\}} / Z \right)

Things are much simpler now because we know that KL\mathcal{KL}-divergence is non-negative and zero only when the two arguments are equal. Hence, the minimum (zero) is attained when

q(θ)=p(θ)exp{(D;θ)}Zq(\theta) = \frac{p(\theta)\exp{\left\{ -\ell(\mathcal{D}; \theta) \right\}}}{Z}

The normalizer can be simply arrived at by integrating out the numerator over θ\theta and we arrive at the optimal Gibbs distribution q(θ)q(\theta).

Connections

Bayesian inference can be seen as an infinite dimensional generalization of the optimization problem described above [3, 5, 4]. This result can also be extended to build all sorts of new schemes based on generalized divergences like the β\beta-divergence [6]. Generalized variational inference [7] encompasses a large family of inference schemes, for instance power likelihoods when we use a tempered divergence 1β\frac{1}{\beta} KL\mathcal{KL}.

Below we discuss a few connections to show the broad coverage of this formulation of the optimization problem.

Bayesian posterior

As an example, the classic Bayesian posterior inference can be recovered by assuming (D;θ)=logp(D  θ)\ell(\mathcal{D}; \theta) = -\log{p(\mathcal{D} ~|~ \theta)}. Plugging this back into the optimal Gibbs distribution we get,

q(θ)=p(θ)p(D  θ)p(θ)p(D  θ)dθq(\theta) = \frac{p(\theta) p(\mathcal{D} ~|~ \theta)}{\int p(\theta^\prime)p(\mathcal{D} ~|~ \theta) d\theta^\prime}

which is reminiscent of the classic posterior recovered by the Bayes theorem, q(θ)=p(θ  D)q(\theta) = p(\theta ~|~ \mathcal{D}).

Further, instead of optimizing over the universal family of distributions P\mathcal{P}, using a restricted family of distributions QP\mathcal{Q} \subset \mathcal{P}, we recover variational inference. For instance, when Q\mathcal{Q} is the family of diagonal covariance Gaussians, we recover Mean-Field variational inference (MFVI).

Entropy-Regularized RL

The Gibbs distribution also comes up in the formulation of Soft Q-Learning. We define an entropy-augmented return [8] as t=0γt(rtτKLt)\sum_{t=0}^{\infty} \gamma^t (r_t - \tau \mathcal{KL}_t) for a discount factor γ[0,1]\gamma \in [0, 1], instantaneuous reward rtr_t, a scalar coefficient τ\tau effectively balancing the explore/exploit dichotomy and the instantaneous distance KLt=KL(π(˙st)  π0(˙st))\mathcal{KL}_t = \mathcal{KL}(\pi(\dot | s_t) ~||~ \pi_0(\dot | s_t) ) between our policy π\pi and a reference policy π0\pi_0 at state sts_t. Without getting into the details, this definition leads us to the state value function at a state ss under policy π\pi as

Vπ(s)=Eaπ[Qπ(s,a)τKL(π  π0)(s)]V_\pi(s) = \mathbb{E}_{a \sim \pi}\left[ Q_\pi(s, a) - \tau \mathcal{KL}(\pi ~||~ \pi_0)(s) \right]

Qπ(s,a)Q_\pi(s, a) is the action value function under the policy π\pi at state ss under action aa. A natural optimal policy to define would be the one that maximizes this state value function greedily. Hence, putting this in the form of our minimization problem earlier, we want to minimize the objective over a family of policies Π\Pi given a reference policy π0\pi_0.

π(  s)=argminπΠ Eaπ[1τQπ(s,a)+KL(π  π0)(s)]\pi(\cdot ~|~ s) = \underset{\pi \in \Pi}{\text{argmin}}~ \mathbb{E}_{a \sim \pi}\left[ -\frac{1}{\tau} Q_\pi(s, a) + \mathcal{KL}(\pi ~||~ \pi_0)(s) \right]

By symmetry, =1τQπ(s,a)\ell = -\frac{1}{\tau} Q_\pi(s, a) and the optimal distribution comes out to be Gibbs distribution of the form

π(a  s)π0(a  s)exp1τQπ(s,a)\pi(a ~|~ s) \propto \pi_0(a ~|~ s)\exp{\frac{1}{\tau} Q_\pi(s, a)}

This formulation forms the foundation of Soft Q-Learning and its derivatives. Intuitively, this provides us a straightforward path to build an optimal policy once we've arrived at the correct action value function. In the limit τ\tau \to \infty, we recover our reference policy π0\pi_0 which basically amounts exploration under the policy π0\pi_0 and never exploiting the policy π\pi.

General Bayes

We make a small final note. The traditional paradigm of Bayesian inference forces us to define a likelihood model and a prior to arrive at the posterior as

p(θ  D)p(θ)priorp(D  θ)model likelihoodp(\theta ~|~ \mathcal{D}) \propto \overbrace{p(\theta)}^{\text{prior}} \underbrace{p(\mathcal{D} ~|~ \theta)}_{\text{model likelihood}}

If one believes in the Cox's Axioms, this formulation is very principled and effective based on decades of evidence. However, as with all inference, Bayesian inference would also break under a violation of assumptions - misspecified model likelihood or a misspecified prior, moreso in some problems than others. For instance, it is still a hard problem to come up with a prior for neural networks as we don't have a completely understanding to what that would really mean - does a Gaussian prior over each unit mean something?

A broader family of work, inspired by Bayesian inference, instead avoids using the terms model likelihood and prior altogether. Connecting this back to our original optimization problem, instead of looking p(θ)p(\theta) as a prior, it serves a way for us to represent favorable predictors of the data in the space of distributions P\mathcal{P}. Similarly, instead of p(D  θ)p(\mathcal{D} ~|~ \theta), we want to describe \ell as merely an instrument which guides us towards a better predictive algorithm. q(θ)q(\theta) doesn't really remain a posterior in the faithful Bayesian sense but is oft referred to as a pseudo-posterior. This approach certainly sounds promising, although still in early stages of development [7, 9] for the modern machine learning world.

References

  1. Bishop, C.M., 2006. Pattern recognition and machine learning, springer.
  2. Boyd, S., Boyd, S.P. & Vandenberghe, L., 2004. Convex optimization, Cambridge university press.
  3. Csiszár, I., 1975. I-divergence geometry of probability distributions and minimization problems. The annals of probability, pp.146–158.
  4. Donsker, M.D. & Varadhan, S. SR , 1976. Asymptotic evaluation of certain Markov process expectations for large time—III. Communications on pure and applied Mathematics, 29(4), pp.389–461.
  5. Futami, F., Sato, I. & Sugiyama, M., 2017. Variational inference based on robust divergences. arXiv preprint arXiv:1710.06595.
  6. Guedj, B., 2019. A primer on PAC-Bayesian learning. arXiv preprint arXiv:1901.05353.
  7. Knoblauch, J., Jewson, J. & Damoulas, T., 2019. Generalized variational inference. arXiv preprint arXiv:1904.02063.
  8. Schulman, J., Chen, X. & Abbeel, P., 2017. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440.
  9. Zellner, A., 1988. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4), pp.278–280.

Footnotes


  1. See Appendix D in Bishop [1].