y= probability of occurrence

Figure 2.3. Posterior distribution for the probability of occurrence assuming a uniform prior. Vertical line indicates the posterior mode.

credible interval is the result of a probability calculation and reflects our posterior belief in the probable range of 0 values given the evidence in the observed data. Thus, we might choose to summarize the analysis by saying, "the probability that 0 lies in the interval [0.04, 0.64] is 0.95." In contrast, the probability statement associated with a confidence interval corresponds to the proportion of confidence intervals that contain the fixed, but unknown, value of 0 in an infinite sequence of hypothetical, repeated samples (see Section 2.3.2). The frequentist's interval therefore requires considerably more explanation and is far from a direct statement of probability. Unfortunately, the difference in interpretation of credible intervals and confidence intervals is often ignored in practice, much to the consternation of many statisticians.

Earlier we mentioned that one of the virtues of Bayesian inference is that probability statements about 0 are exact for any sample size. This is especially meaningful when one considers that a Bayesian analysis yields the entire posterior pdf of 0, n(0|y), as opposed to a single point estimate of 0. Therefore, in addition to computing summaries of the posterior, such as its mean E(0|y) or variance Var(0|y), any function of 0 can be calculated while accounting for all of the posterior uncertainty in 0. The benefits of being able to manage errors in estimation in this way are especially evident in computing inferences for latent parameters of hierarchical models, as we will illustrate in Section 2.6, or in computing predictions that depend on the estimated value of 0.

Specification of the prior distribution may be perceived as a benefit or as a disadvantage of the Bayesian mode of inference. In scientific problems where prior information about 0 may exist or can be elicited (say, from expert opinion), Bayes' theorem reveals precisely how such information may be used when computing inferences for 0. In other (or perhaps most) scientific problems, little may be known about the probable magnitude of 0 in advance of an experiment or survey. In these cases an objective approach would be to use a prior that places equal (or nearly equal) probability on all values of 0. Such priors are often called 'vague' or 'noninformative.' A problem with this approach is that priors are not invariant to transformation of the parameters. In other words a prior that is 'non-informative' for 0 can be quite informative for g(0), a one-to-one transformation of 0.

One solution to this problem is to develop a prior that is both non-informative and invariant to transformation of its parameters. A variety of such 'objective priors', as they are currently called (see Chapter 5 of Ghosh et al. (2006)), have been developed for models with relatively few parameters. Objective priors are often improper (that is, J" n(0) d0 = to); therefore, if an objective prior is to be used, the analyst must prove that that the resulting posterior distribution is proper (that is, f f(y|0)n(0)d0 < to). Such proofs often require considerable mathematical expertise, particularly for models that contain many parameters.

A second solution to the problem of constructing a non-informative prior is to identify a particular parameterization of the model for which a uniform (or nearly uniform) prior makes sense. Of course, this approach is possible only if we are able to assign scientific relevance and context to the model's parameters. We have found this approach to be useful in the analysis of ecological data, and we use this approach throughout the book.

Specification of the prior distribution can be viewed as the 'price' paid for the exactness of inferences computed using Bayes' theorem. When the sample size is low, the price of an exact inference may be high. As the size of a sample increases, the price of an exact inference declines because the information in the data eventually exceeds the information in the prior. We will return to this tradeoff in the next section, where we describe some asymptotic properties of posteriors.

2.4.3 Asymptotic Properties of the Posterior Distribution

We have noted already that the Bayesian approach to model-based inference has several appealing characteristics. In this section we describe additional features that are associated with computing inferences from large samples.

Y = probability of occurrence

Figure 2.4. A normal approximation (dashed line) of the posterior distribution of the probability of occurrence (solid line). Vertical line indicates the posterior mode.

Y = probability of occurrence

Figure 2.4. A normal approximation (dashed line) of the posterior distribution of the probability of occurrence (solid line). Vertical line indicates the posterior mode.

Let [0|y] denote the posterior distribution of 0 given an observed set of data y. If a set of 'regularity conditions' that have to do with technical details, such as identifiability of the model's parameters and differentiability of the posterior density function n(0|y), are satisfied, we can prove that as sample size n ^ ro,

where 0 is the posterior mode and I(0 ) = — d lodggdeg|y) le=e is called the generalized observed information (Ghosh et al., 2006). The practical utility of this limiting behavior is that the posterior distribution of 0 can be approximated by a normal distribution N(0 , [I(0 )]-1) if n is sufficiently large. In other words, when n is large, we can expect the posterior to become highly concentrated around the posterior mode 0 .

Example: estimating the probability of occurrence

Recall from Section 2.4.1.1 that the posterior mode for the probability of occurrence was 0 = v/n when a Be(1,1) prior was assumed for 0. It is easily proved that [I(0)]-1 = 0(1 — 0)/n given this choice of prior; therefore, according to Eq. (2.4.2) we can expect a N(0 , 0(1 — 0)/n) distribution to approximate the true posterior, a Be(1 + v, 1 + n — v) distribution, when n is sufficiently large. Figure 2.4 illustrates that the approximation holds very well for a sample of n = 50 locations, of which z = 10 are occupied.

The asymptotic normality of the posterior (indicated in Eq. (2.4.2)) is an important result because it establishes formally that the relative importance of the prior distribution must decrease with an increase in sample size. To see this, note that I(0) is the sum of two components, one due to the likelihood function f (y|0) and another due to the prior density n(0):

d0d0 d0d0

As n increases, only the magnitude of the first term on the right-hand side of Eq. (2.4.3) increases, whereas the magnitude of the second term, which quantifies the information in the prior, remains constant. An important consequence of this result is that we can expect inferences to be insensitive to the choice of prior if we have enough data. On the other hand, if the sample size is relatively small, the prior distribution may be a critical part of the model specification.

Thus far, we have illustrated the Bayesian approach to inference using a rather simple model (binomial likelihood and conjugate prior), where the posterior density function n(0|y) could be expressed in closed form. However, in many (perhaps most) cases of scientific interest, an approximating model of the data will be more complex, often involving many parameters and multiple levels of parameters. In such cases it is often difficult or impossible to calculate the normalizing constant m(y) accurately because the calculation requires a p-dimensional integration if the model contains p distinct parameters. Therefore, the posterior density is often known only up to a constant of proportionality n(0|y) <X f(y|0)n(0).

This computational impediment is one the primary reasons why the Bayesian approach to inference was not widely used prior to the late 20th century. Opposition by frequentists, many of whom strongly advocated classical inference procedures, is another reason. In 1774, Pierre Simon Laplace developed a method for computing a large-sample approximation of the normalizing constant m(y), but this procedure is applicable only in cases where the unnormalized posterior density is a smooth function of 0 with a sharp maximum at the posterior mode (Laplace, 1986). Refinements of Laplace's method have been developed (Ghosh et al., 2006,

Section 4.3.2), but these refinements, as with Laplace's method, lack generality in their range of application.

A recent upsurge in the use of Bayesian inference procedures can be attributed to the widespread availability of fast computers and to the development of efficient algorithms, known collectively as Markov chain Monte Carlo (MCMC) samplers. These include the Gibbs sampler, the Metropolis-Hastings algorithm, and others (Robert and Casella, 2004). The basic idea behind these algorithms is to compute an arbitrarily large sample from the posterior distribution 0|y without actually computing its normalizing constant m(y). Given an arbitrarily large sample of the posterior, the posterior density n(0|y) can be approximated quite accurately (say, using a histogram or a kernel-density smoother). In addition, any function of the posterior, such as the marginal mean, median, or standard deviation for an individual component of 0, can be computed quite easily without actually evaluating the integrals implied in the calculation.

One of the most widely used algorithms for sampling posterior distributions is called the Gibbs sampler. This algorithm is a special case of the Metropolis-Hastings algorithm, and the two are often used together to produce efficient hybrid algorithms that are relatively easy to implement.

The basic idea behind Gibbs sampling (and other MCMC algorithms) is to produce a random sample from the joint posterior distribution of >2 parameters by drawing random samples from a sequence of full conditional posterior distributions. The motivation for this idea is that while it may be difficult or impossible to draw a sample directly from the joint posterior, drawing a sample from the full conditional distribution of each parameter is often a relatively simple calculation.

Let's illustrate the Gibbs sampler for a model that includes 3 parameters: 0i, 02, and 03. Given a set of data y, we wish to compute an arbitrarily large sample from the joint posterior [0i, 02, 03|y]. We assume that the full-conditional distributions

Was this article helpful?

## Post a comment