## Bayesian Inference

In this section we describe the Bayesian approach to model-based inference. To facilitate comparisons with classical inference procedures, we will apply the Bayesian approach to some of the same examples used in Section 2.3.

Let y = (yi,..., yn) denote a sample of n observations, and suppose we develop an approximating model of y that contains a (possibly vector-valued) parameter 0. As in classical statistics, the approximating model is a formal expression of the processes that are assumed to have produced the observed data. However, in the Bayesian view the model parameter 0 is treated as a random variable and the approximating model is elaborated to include a probability distribution for 0 that specifies one's beliefs about the magnitude of 0 prior to having observed the data. This elaboration of the model is therefore called the prior distribution.

In the Bayesian view, computing an inference about 0 is fundamentally just a probability calculation that yields the probable magnitude of 0 given the assumed

> y = c(8.51, 4.03, 8.20, 4.19, 8.72, 6.15, 5.40, 8.66, 7.91, 8.58)

> neglogLike = function(param) { + mu = param[1]

> fit = optim(par=c(0,0), fn=neglogLike, method='BFGS', hessian=TRUE)

> fit\$hessian

> covMat = chol2inv(chol(fit\$hessian))

> covMat

[2,] 2.049843e-07 4.999987e-02

> c(mu.mle-zcrit*mu.se, mu.mle+zcrit*mu.se) [1] 5.913433 8.156571

Panel 2.7. R code for computing a 95 percent confidence interval for prior distribution and given the evidence in the data. To accomplish this calculation, the observed data y are assumed to be fixed (once the sample has been obtained), and all inferences about 0 are made with respect to the fixed observations y. Unlike classical statistics, Bayesian inferences do not rely on the idea of hypothetical repeated samples or on the asymptotic properties of estimators of 0. In fact, probability statements (i.e., inferences) about 0 are exact for any sample size under the Bayesian paradigm.

2.4.1 Bayes' Theorem and the Problem of 'Inverse Probability'

To describe the principles of Bayesian inference in more concrete terms, it's convenient to begin with some definitions. Let's assume, without loss of generality, that the observed data y are modeled as continuous random variables and that f (y|0) denotes the joint pdf of y given a model indexed by the parameter 0. In other words, f(y|0) is an approximating model of the data. Let n(0) denote the pdf of an assumed prior distribution of 0. Note that f (y|0) provides the probability of the data given 0. However, once the data have been collected the value of y is known; therefore, to compute an inference about 0, we really need the probability of 0 given the evidence in the data, which we denote by n(0|y).

Historically, the question of how to compute n(0|y) was called the 'problem of inverse probability.' In the 18th century Reverend Thomas Bayes (1763) provided a solution to this problem, showing that n(0|y) can be calculated to update one's prior beliefs (as summarized in n(0)) using the laws of probability3:

where m(y) = f f (y|0)n(0) d0 denotes the marginal probability of y. Eq. (2.4.1) is known as Bayes' theorem (or Bayes' rule), and 0|y is called the posterior distribution of 0 to remind us that n(0|y) summarizes one's beliefs about the magnitude of 0 after having observed the data. Bayes' theorem provides a coherent, probability-based framework for inference because it specifies how prior beliefs about 0 can be converted into posterior beliefs in light of the evidence in the data.

Close ties obviously exist between Bayes' theorem and likelihood-based inference because f(y|0) is also the basis of Fisher's likelihood function (Section 2.3.1). However, Fisher was vehemently opposed to the 'theory of inverse probability',

3Based on the definition of conditional probability, we know [0|y] = [y,0]/[y] and [y|0] = [y,0]/[0]. Rearranging the second equation yields the joint pdf, [y,0] = [y|0][0], which when substituted into the first equation produces Bayes' rule: [0|y] = ([y|0][0])/[y].

as applications of Bayes' theorem were called in his day. Fisher sought inference procedures that did not rely on the specification of a prior distribution, and he deliberately used the term 'likelihood' for f (y|0) instead of calling it a probability. Therefore, it is important to remember that although the likelihood function is present in both inference paradigms (i.e., classical and Bayesian), dramatic differences exist in the way that f (y|0) is used and interpreted.

2.4.1.1 Example: estimating the probability of occurrence

Let's reconsider the problem introduced in Section 2.3.1.1 of computing inferences about the probability of occurrence —. Our approximating model of v, the total number of sample locations where the species is present, is given by the binomial pmf f (v|-) given in Eq. (2.3.4). A prior density ) is required to compute inferences about - from the posterior density f (v|-)n(-)

where m(v) = J0 f (v| —)n(—) d—. It turns out that |v) can be expressed in closed form if the prior ) = Be(—|a, b) is assumed, where the values of a and b are fixed (by assumption). To be specific, this choice of prior implies that the posterior distribution of — is Be(a + v, b + n — v). Thus, the prior and posterior distributions belong to the same class of distributions (in this case, the class of beta distributions). This equivalence, known as conjugacy, identifies the beta distribution as the conjugate prior for the success parameter of a binomial distribution. We will encounter other examples of conjugacy throughout this book. For now, let's continue with the example.

Suppose we assume prior indifference in the magnitude of —. In other words, before observing the data, we assume that all values of — are equally probable. This assumption is specified with a Be(1,1) prior (=U(0,1) prior) and implies that the posterior distribution of — is Be(1 + v, 1 + n — v). It's worth noting that the mode of this distribution equals v/n, which is equivalent to the MLE of — obtained in a classical, likelihood-based analysis. Now suppose a sample of n =5 locations contains only v = 1 occupied site; then the Be(2, 5) distribution, illustrated in Figure 2.3, may be used to compute inferences for —. For example, the posterior mean and mode of — are 0.29 and 0.20, respectively. Furthermore, we can compute the a/2 and 1 — a/2 quantiles of the Be(2, 5) posterior and use these to obtain a 100(1 — a) percent credible interval for —. (Bayesians use the term, 'credible interval', to distinguish it from the frequentist concept of a confidence interval.) For example, the 95 percent credible interval for — is [0.04,0.64].

We use this example to emphasize that a Bayesian credible interval and a fre-quentist confidence interval have completely different interpretations. The Bayesian