Home > Statistics > Wishart distribution in WinBUGS, nonstandard parameterization

Wishart distribution in WinBUGS, nonstandard parameterization

The Wishart distribution and especially the inverse-Wishart distribution are the source of some confusion because they occasionally appear with alternative parameterizations. Also, the Wishart distribution can be used to model a covariance matrix or a precision matrix (the inverse of a covariance matrix) in different situations, and the inverse-Wishart the same, but the other way round. It’s already becoming complicated. Hal Stern, coauthor of Bayesian Data Analysis (BDA), helped to clarify many issues for me in an email conversation. In this post I hope to clarify the differences in Wishart parameterizations of BDA, the wikipedia pages, and the WinBUGS and OpenBUGS softwares, and show an example in OpenBUGS where the inverse parameterization has to be specified relative to the distribution’s definition for the correct posterior to result.

The Wishart distribution commonly arises in the context of the Bayesian multivariate normal model, so first I detail that model. Let d-dimensional vector y_i, i=1,\ldots,n, be independent with y_i|\mu,\Sigma\sim\textrm{Normal}(\mu,\Sigma), where \mu is a (column) vector of length d and \Sigma is a d\times d covariance matrix, which is symmetric and positive definite. The conjugate priors for the parameters are \mu|\mu_0,\Sigma,\kappa_0\sim\textrm{Normal}(\mu_0,\Sigma/\kappa_0), where \mu_0 is the prior mean and \kappa_0 is the number of prior measurements on the \Sigma scale, and \Sigma|\Sigma_0,\nu_0\sim\textrm{Inv-Wishart}(\Sigma_0,\nu_0), where \Sigma_0 is the prior covariance matrix with \nu_0 degrees of freedom.

I’ll follow the definition of the Wishart and inverse-Wishart distributions used in BDA and Wikipedia. W|A,\nu\sim\textrm{Wishart}(A,\nu) if p(W|A,\nu)\propto|A|^{-\nu/2}|W|^{(\nu-d-1)/2}\exp\left(-\frac{1}{2}\textrm{tr}(A^{-1}W)\right), with expected value \textrm{E}(W)=\nu A (notation: \textrm{tr}(X^{T}) indicates the trace of the transpose of matrix X) . The integral is finite when \nu\ge d and the density is finite if \nu\ge d+1. In this definition A and W are on the same scale. A and W are both covariance matrices when modeling a sample covariance matrix. Specifically, the scatter matrix S, where S=\sum_{i=1}^n(y_i-\bar{y})(y_i-\bar{y})^{T}, has S|V,n\sim\textrm{Wishart}(V,n), where V is the population covariance and n is the sample size informing S. A and W are both precision matrices when modeling the inverse covariance (precision) matrix (\tau\equiv\Sigma^{-1}) of the normal model above. That is \tau|\tau_0,\nu_0\sim\textrm{Wishart}(\tau_0,\nu_0), where \tau_0 is the inverse of the prior population covariance and \nu_0 can be thought of as the prior sample size. If B^{-1}|A,\nu\sim\textrm{Wishart}(A,\nu) (where B is a covariance matrix and B^{-1} and A are precision matrices) then B|A^{-1},\nu\sim\textrm{Inv-Wishart}(A^{-1},\nu).  Now, with W and A as covariance matricies, W|A,\nu\sim\textrm{Inv-Wishart}(A,\nu) if p(W|A,\nu)\propto|A|^{\nu/2}|W|^{-(\nu+d+1)/2}\exp\left(-\frac{1}{2}\textrm{tr}(AW^{-1})\right), with expected value \textrm{E}(W)=A/(\nu-d-1).

One point of confusion was when using WinBUGS and OpenBUGS to fit the Bayesian multivariate normal model. I’ll call the WinBUGS parameterization of the Wishart the “BUGS-Wishart” with a covariance parameter (R) and a precision random matrix (W) (these are on inverse scales from one another, where our standard definition above had the parameter and random matrix on the same scale). W|R,\nu\sim\textrm{BUGS-Wishart} if p(W|R,\nu)\propto|R|^{\nu/2}|W|^{(\nu-d-1)/2}\exp\left(-\frac{1}{2}\textrm{tr}(RW)\right), with expected value \textrm{E}(W)=\nu R^{-1}. By itself, this was straightforward for the parameter to be on the covariance scale rather than the precision scale, however it becomes misleading when incorporated into the normal model. For example, I expected a non-informative prior to have covariance matrix R with large diagonal elements. But the following example shows the need for a Bayesian to write down the full posterior to see how the prior and likelihoods combine in the posterior.

The second point of confusion was in the prior distributional specification for precision matrix \tau in the OpenBUGS Camel example. This example has replicated structured missing bivariate data at \pm 2, but four complete-data points at the corners of a square (\pm 1) centered at zero. The camel example is named so because the posterior for \rho (rho) has two humps; two of the complete observations support correlation +1 and two support correlation -1. The data has a normal distribution parameterized with precision matrix, \tau, as y_i\sim\textrm{Normal}([0,0]^{T},\tau), and precision matrix \tau\sim\textrm{Wishart}(R,2), where R=\textrm{diag}(0.001,0.001). The correlation \rho is calculated from \tau by inverting \tau to get the covariance matrix \Sigma and taking the off-diagonal covariance divided by the square root of the product of the on-diagonal variances. Because the BUGS-Wishart distribution has a covariance parameter R, I was surprised at their use of a noninformative prior parameter matrix with diagonal values of 0.001 (instead of variance values of 1000). Hal Stern helped clarify why the prior they use is the correct one by writing out the posterior distribution based on the BUGS-Wishart density and the normal likelihood (which is parameterized in terms of precision \tau). Looking at the exponential kernel of the BUGS-Wishart distribution, we have \exp\left(-\frac{1}{2}\textrm{tr}(R\tau)\right). The kernel of the likelihood is \exp\left(-\frac{1}{2}\sum_{i=1}^n(y_i-\mu)^{T}\tau(y_i-\mu)\right)=\exp\left(-\frac{1}{2}\textrm{tr}(S_0\tau)\right), where S_0=\sum_{i=1}^n(y_i-\mu)(y_i-\mu)^{T} is the “sums of squares” relative to \mu. Thus, when the prior and likelihood are combined in the posterior, the posterior for \tau is now \exp\left(-\frac{1}{2}\textrm{tr}((R+S_0)\tau)\right), which is BUGS-Wishart with parameter R+S_0. This is the correct noninformative analysis since the near-zero matrix R from the prior contributes little to the sum with S_0 informed by the data. However, it requires looking ahead to the form of the posterior to see that the parameter for the BUGS-Wishart needs to be specified as a precision matrix INSTEAD of a covariance matrix! Dave Lunn (first author of the reference for WinBUGS) has written to me that he will try to clarify matters in the next version. I would like to see a second version of the Wishart implemented in WinBUGS and OpenBUGS with the parameter and the random matrix on the same scale, as in BDA, to avoid this awkward inverse specification of the parameter for the BUGS-Wishart distribution.

Statistics

  1. Alok Saboo
    November 11th, 2010 at 21:13 | #1

    I have a quick question about the covariance matrix and the Wishart function in BUGS. Is there a way to constraint the value of the covariance matrix generated from the wishart function, e.g., we may want to specify two variables as uncorrelated (i.e., 0 in the covariance matrix) or we do not want to free estimate the variable of certain variable and specify it to be 1 (for identification purposes). Thanks!!

  2. Bob Carlton
    February 24th, 2011 at 10:51 | #2

    I was also puzzled at first about the camel example -> but reached a different conclusion. The problem with the camel example is that the degree of freedom parameter (v) of the wishart is not >= dim(matrix)+2, therefore this prior is not an integrateable density. This means that E(v,matrix) is not interpretable in the usual way. One can easily see this if you just draw from the wishart in winbugs setting the degree of freedom parameter to 4 (not using it as a prior, just drawing from the distribution), then everything behaves as expected. Stating this in a more convenient Inverse Wishart as a conjugate to a normal way:
    You want E=Inv-Wishart(v,V) with both E and V in Covariance Form
    then prec.E=Bugs-Wishart(v,V) and v>=dim(V)+2 and V=covariance matrix
    results in prec.E equals the precision matrix for the normal and Inverse[prec.E] is the covariance matrix. The camel example can violates this because v not >= dim(V)+2, so it is not a good example for the general interpretation and general cases.
    mean sd MC_error val2.5pc median val97.5pc start sample
    T[1,1] 0.001262 0.028 3.075E-4 1.053E-4 4.268E-4 0.004854 1001 8400
    T[1,2] -7.499E-5 0.006116 6.748E-5 -0.001702 3.507E-6 0.001648 1001 8400
    T[2,1] -7.499E-5 0.006116 6.748E-5 -0.001702 3.507E-6 0.001648 1001 8400
    T[2,2] 0.001042 0.007392 7.701E-5 1.081E-4 4.236E-4 0.004577 1001 8400
    prec.T[1,1] 3981.0 2849.0 37.23 473.9 3320.0 11350.0 1001 8400
    prec.T[1,2] -15.8 1970.0 20.64 -4085.0 -36.18 4077.0 1001 8400
    prec.T[2,1] -15.8 1970.0 20.64 -4085.0 -36.18 4077.0 1001 8400
    prec.T[2,2] 3973.0 2810.0 26.06 517.2 3324.0 11130.0 1001 8400

  1. No trackbacks yet.