Introduction
In the previous post I showed that it is possible to couple parallel tempered MCMC chains in order to improve mixing. Such methods can be used when the target of interest is a Bayesian posterior distribution that is difficult to sample. There are (at least) a couple of obvious ways that one can temper a Bayesian posterior distribution. Perhaps the most obvious way is a simple flattening, so that if
is the posterior distribution, then for we define
This corresponds with the tempering that is often used in statistical physics applications. We recover the posterior of interest for and tend to a flat distribution as
. However, for Bayesian posterior distributions, there is a different way of tempering that is often more natural and useful, and that is to temper using the power posterior, defined by
Here we again recover the posterior for , but get the prior for
. Thus, the family of distributions forms a natural bridge or path from the prior to the posterior distributions. The power posterior is a special case of the more general concept of a geometric path from distribution
(at
) to
(at
) defined by
where, in our case, is the prior and
is the posterior.
So, given a posterior distribution that is difficult to sample, choose a temperature schedule
and run a parallel tempering scheme as outlined in the previous post. The idea is that for small values of mixing will be good, as prior-like distributions are usually well-behaved, and the mixing of these "high temperature" chains can help to improve the mixing of the "low temperature" chains that are more like the posterior (note that
is really an inverse temperature parameter the way I’ve defined it here…).
Marginal likelihood and normalising constants
The marginal likelihood of a Bayesian model is
This quantity is of interest for many reasons, including calculation of the Bayes factor between two competing models. Note that this quantity has several different names in different fields. In particular, it is often known as the evidence, due to its role in Bayes factors. It is also worth noting that it is the normalising constant of the Bayesian posterior distribution. Although it is very easy to describe and define, it is notoriously difficult to compute reliably for complex models.
The normalising constant is conceptually very easy to estimate. From the above integral representation, it is clear that
where the expectation is taken with respect to the prior. So, given samples from the prior, , we can construct the Monte Carlo estimate
and this will be a consistent estimator of the true evidence under fairly mild regularity conditions. Unfortunately, in practice it is likely to be a very poor estimator if the posterior and prior are not very similar. Now, we could also use Bayes theorem to re-write the integral as an expectation with respect to the posterior, so we could then use samples from the posterior to estimate the evidence. This leads to the harmonic mean estimator of the evidence, which has been described as the worst Monte Carlo method ever! Now it turns out that there are many different ways one can construct estimators of the evidence using samples from the prior and the posterior, some of which are considerably better than the two I’ve outlined. This is the subject of the bridge sampling paper of Meng and Wong. However, the reality is that no method will work well if the prior and posterior are very different.
If we have tempered chains, then we have a sequence of chains targeting distributions which, by construction, are not too different, and so we can use the output from tempered chains in order to construct estimators of the evidence that are more numerically stable. If we call the evidence of the th chain
, so that
and
, then we can write the evidence in telescoping fashion as
Now the th term in this product is
, which can be estimated using the output from the
th and/or
th chain(s). Again, this can be done in a variety of ways, using your favourite bridge sampling estimator, but the point is that the estimator should be reasonably good due to the fact that the
th and
th targets are very similar. For the power posterior, the simplest method is to write
where the expectation is with respect to the th target, and hence can be estimated in the usual way using samples from the
th chain.
For numerical stability, in practice we compute the log of the evidence as
The above expression is exact, and is the obvious formula to use for computation. However, it is clear that if and
are sufficiently close, it will be approximately OK to switch the expectation and exponential, giving
In the continuous limit, this gives rise to the well-known path sampling identity,
So, an alternative approach to computing the evidence is to use the samples to approximately numerically integrate the above integral, say, using the trapezium rule. However, it isn’t completely clear (to me) that this is better than using directly, since there there is no numerical integration error to worry about.
Numerical illustration
We can illustrate these ideas using the simple double potential well example from the previous post. Now that example doesn’t really correspond to a Bayesian posterior, and is tempered directly, rather than as a power posterior, but essentially the same ideas follow for general parallel tempered distributions. In general, we can use the sample to estimate the ratio of the last and first normalising constants, . Here it isn’t obvious why we’d want to know that, but we’ll compute it anyway to illustrate the method. As before, we expand as a telescopic product, where the
th term is now
A Monte Carlo estimate of each of these terms is formed using the samples from the th chain, and the logs of these are then summed to give
. A complete R script to run the Metropolis coupled sampler and compute the evidence is given below.
U=function(gam,x) { gam*(x*x-1)*(x*x-1) } temps=2^(0:3) iters=1e5 chains=function(pot=U, tune=0.1, init=1) { x=rep(init,length(temps)) xmat=matrix(0,iters,length(temps)) for (i in 1:iters) { can=x+rnorm(length(temps),0,tune) logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can)) accept=(log(runif(length(temps)))<logA) x[accept]=can[accept] swap=sample(1:length(temps),2) logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])- pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]]) if (log(runif(1))<logA) x[swap]=rev(x[swap]) xmat[i,]=x } colnames(xmat)=paste("gamma=",temps,sep="") xmat } mat=chains() mat=mat[,1:(length(temps)-1)] diffs=diff(temps) mat=(mat*mat-1)^2 mat=-t(diffs*t(mat)) mat=exp(mat) logEvidence=sum(log(colMeans(mat))) message(paste("The log of the ratio of the last and first normalising constants is",logEvidence))
It turns out that these double well potential densities are tractable, and so the normalising constants can be computed exactly. So, with a little help from Wolfram Alpha, I compute log of the ratio of the last and first normalising constants to be approximately -1.12. Hopefully the above script will output something a bit like that…
References
- Meng, Xiao-Li, and Wing Hung Wong. “Simulating ratios of normalizing constants via a simple identity: a theoretical exploration.” Statistica Sinica 6.4 (1996): 831-860.
- Gelman, Andrew, and Xiao-Li Meng. “Simulating normalizing constants: From importance sampling to bridge sampling to path sampling.” Statistical Science (1998): 163-185.
- Friel, Nial, and Anthony N. Pettitt. “Marginal likelihood estimation via power posteriors.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.3 (2008): 589-607.
- Friel, Nial, and Jason Wyse. “Estimating the evidence–a review.” Statistica Neerlandica 66.3 (2012): 288-308.
