The Wishart distribution and especially the inverse-Wishart distribution are the source of some confusion because they occasionally appear with alternative parameterizations. Also, the Wishart distribution can be used to model a covariance matrix or a precision matrix (the inverse of a covariance matrix) in different situations, and the inverse-Wishart the same, but the other way round. It’s already becoming complicated. Hal Stern, coauthor of Bayesian Data Analysis (BDA), helped to clarify many issues for me in an email conversation. In this post I hope to clarify the differences in Wishart parameterizations of BDA, the wikipedia pages, and the WinBUGS and OpenBUGS softwares, and show an example in OpenBUGS where the inverse parameterization has to be specified relative to the distribution’s definition for the correct posterior to result.

The Wishart distribution commonly arises in the context of the Bayesian multivariate normal model, so first I detail that model. Let $$d$$-dimensional vector $$y_i$$, $$i=1,\ldots,n$$, be independent with $$y_i|\mu,\Sigma\sim\textrm{Normal}(\mu,\Sigma)$$, where $$\mu$$ is a (column) vector of length $$d$$ and $$\Sigma$$ is a $$d\times d$$ covariance matrix, which is symmetric and positive definite. The conjugate priors for the parameters are $$\mu|\mu_0,\Sigma,\kappa_0\sim\textrm{Normal}(\mu_0,\Sigma/\kappa_0)$$, where $$\mu_0$$ is the prior mean and $$\kappa_0$$ is the number of prior measurements on the $$\Sigma$$ scale, and $$\Sigma|\Sigma_0,\nu_0\sim\textrm{Inv-Wishart}(\Sigma_0,\nu_0)$$, where $$\Sigma_0$$ is the prior covariance matrix with $$\nu_0$$ degrees of freedom.

I’ll follow the definition of the Wishart and inverse-Wishart distributions used in BDA and Wikipedia. $$W|A,\nu\sim\textrm{Wishart}(A,\nu)$$ if $$p(W|A,\nu)\propto|A|^{-\nu/2}|W|^{(\nu-d-1)/2}\exp\left(-\frac{1}{2}\textrm{tr}(A^{-1}W)\right)$$, with expected value $$\textrm{E}(W)=\nu A$$ (notation: $$\textrm{tr}(X^{T})$$ indicates the trace of the transpose of matrix $$X$$) . The integral is finite when $$\nu\ge d$$ and the density is finite if $$\nu\ge d+1$$. In this definition $$A$$ and $$W$$ are on the same scale. $$A$$ and $$W$$ are both covariance matrices when modeling a sample covariance matrix. Specifically, the scatter matrix $$S$$, where $$S=\sum_{i=1}^n(y_i-\bar{y})(y_i-\bar{y})^{T}$$, has $$S|V,n\sim\textrm{Wishart}(V,n)$$, where $$V$$ is the population covariance and $$n$$ is the sample size informing $$S$$. $$A$$ and $$W$$ are both precision matrices when modeling the inverse covariance (precision) matrix ($$\tau\equiv\Sigma^{-1}$$) of the normal model above. That is $$\tau|\tau_0,\nu_0\sim\textrm{Wishart}(\tau_0,\nu_0)$$, where $$\tau_0$$ is the inverse of the prior population covariance and $$\nu_0$$ can be thought of as the prior sample size. If $$B^{-1}|A,\nu\sim\textrm{Wishart}(A,\nu)$$ (where $$B$$ is a covariance matrix and $$B^{-1}$$ and $$A$$ are precision matrices) then $$B|A^{-1},\nu\sim\textrm{Inv-Wishart}(A^{-1},\nu)$$. Now, with $$W$$ and $$A$$ as covariance matricies, $$W|A,\nu\sim\textrm{Inv-Wishart}(A,\nu)$$ if $$p(W|A,\nu)\propto|A|^{\nu/2}|W|^{-(\nu+d+1)/2}\exp\left(-\frac{1}{2}\textrm{tr}(AW^{-1})\right)$$, with expected value $$\textrm{E}(W)=A/(\nu-d-1)$$.

One point of confusion was when using WinBUGS and OpenBUGS to fit the Bayesian multivariate normal model. I’ll call the WinBUGS parameterization of the Wishart the “BUGS-Wishart” with a covariance parameter ($$R$$) and a precision random matrix ($$W$$) (these are on inverse scales from one another, where our standard definition above had the parameter and random matrix on the same scale). $$W|R,\nu\sim\textrm{BUGS-Wishart}$$ if $$p(W|R,\nu)\propto|R|^{\nu/2}|W|^{(\nu-d-1)/2}\exp\left(-\frac{1}{2}\textrm{tr}(RW)\right)$$, with expected value $$\textrm{E}(W)=\nu R^{-1}$$. By itself, this was straightforward for the parameter to be on the covariance scale rather than the precision scale, however it becomes misleading when incorporated into the normal model. For example, I expected a non-informative prior to have covariance matrix $$R$$ with large diagonal elements. But the following example shows the need for a Bayesian to write down the full posterior to see how the prior and likelihoods combine in the posterior.

The second point of confusion was in the prior distributional specification for precision matrix $$\tau$$ in the OpenBUGS Camel example. This example has replicated structured missing bivariate data at $$\pm 2$$, but four complete-data points at the corners of a square ($$\pm 1$$) centered at zero. The camel example is named so because the posterior for $$\rho$$ (rho) has two humps; two of the complete observations support correlation $$+1$$ and two support correlation $$-1$$. The data has a normal distribution parameterized with precision matrix, $$\tau$$, as $$y_i\sim\textrm{Normal}([0,0]^{T},\tau)$$, and precision matrix $$\tau\sim\textrm{Wishart}(R,2)$$, where $$R=\textrm{diag}(0.001,0.001)$$. The correlation $$\rho$$ is calculated from $$\tau$$ by inverting $$\tau$$ to get the covariance matrix $$\Sigma$$ and taking the off-diagonal covariance divided by the square root of the product of the on-diagonal variances. Because the BUGS-Wishart distribution has a covariance parameter $$R$$, I was surprised at their use of a noninformative prior parameter matrix with diagonal values of 0.001 (instead of variance values of 1000). Hal Stern helped clarify why the prior they use is the correct one by writing out the posterior distribution based on the BUGS-Wishart density and the normal likelihood (which is parameterized in terms of precision $$\tau$$). Looking at the exponential kernel of the BUGS-Wishart distribution, we have $$\exp\left(-\frac{1}{2}\textrm{tr}(R\tau)\right)$$. The kernel of the likelihood is $$\exp\left(-\frac{1}{2}\sum_{i=1}^n(y_i-\mu)^{T}\tau(y_i-\mu)\right)=\exp\left(-\frac{1}{2}\textrm{tr}(S_0\tau)\right)$$, where $$S_0=\sum_{i=1}^n(y_i-\mu)(y_i-\mu)^{T}$$ is the “sums of squares” relative to $$\mu$$. Thus, when the prior and likelihood are combined in the posterior, the posterior for $$\tau$$ is now $$\exp\left(-\frac{1}{2}\textrm{tr}((R+S_0)\tau)\right)$$, which is BUGS-Wishart with parameter $$R+S_0$$. This is the correct noninformative analysis since the near-zero matrix $$R$$ from the prior contributes little to the sum with $$S_0$$ informed by the data. However, it requires looking ahead to the form of the posterior to see that the parameter for the BUGS-Wishart needs to be specified as a precision matrix INSTEAD of a covariance matrix! Dave Lunn (first author of the reference for WinBUGS) has written to me that he will try to clarify matters in the next version. *I would like to see a second version of the Wishart implemented in WinBUGS and OpenBUGS with the parameter and the random matrix on the same scale, as in BDA, to avoid this awkward inverse specification of the parameter for the BUGS-Wishart distribution.*