## Info

It turns out that an analytical solution exists in closed form. The MLE is (M, a2) = (V, s2), where y and s2 denote the sample mean and variance, respectively, of the observed body masses. Note that a2 is strictly less than s2, the usual (unbiased) estimator of a2; however, the difference becomes negligible as sample size n increases.

Suppose we could not have found the MLE of (m, a2) in closed form. In this case we need to compute a numerical approximation. We could try the brute-force approach used in the earlier example, but in this example we would need to evaluate the log-likelihood function over 2 dimensions. Furthermore, the parameter space is unbounded (R x R+), so we would need to restrict evaluations of the log-likelihood to be in the vicinity of the MLE, which we do not know!

It turns out that the brute-force approach is seldom feasible, particularly as the number of model parameters becomes large. Therefore, numerical methods of

> neglogLike = function(psi) -dbinom(v, size=n, prob=psi, log=TRUE)

> fit = optim(par=0.5, fn=neglogLike, method='BFGS')

Warning messages: 1: NaNs produced in: dbinom(x, size, prob, log) 2:

NaNs produced in: dbinom(x, size, prob, log)

Panel 2.2. R code for numerically maximizing the likelihood function to approximate ip.

optimization are often used to compute an approximation of the MLE. To illustrate, suppose we observe the following body masses of 10 animals:

y = (8.51,4.03, 8.20,4.19, 8.72, 6.15, 5.40, 8.66, 7.91, 8.58)

which have sample mean y = 7.035 and sample variance s2 = 3.638. Panel 2.3 contains R code for approximating the MLE and yields the estimates = 7.035 and a2 = 3.274, which are the correct answers.

### 2.3.2 Properties of MLEs

Maximum likelihood estimators have several desirable properties. In this section we describe these properties and illustrate their consequences in inference problems. In particular, we show how MLEs are used in the construction of confidence intervals.

### 2.3.2.1 Invariance to reparameterization

If 0 is the MLE of 0, then for any one-to-one function t(0), the MLE of t(0) is t(0). This invariance to reparameterization can be extremely helpful in computing

> y = c(8.51, 4.03, 8.20, 4.19, 8.72, 6.15, 5.40, 8.66, 7.91, 8.58)

> neglogLike = function(param) { + mu = param

> fit = optim(par=c(0,0), fn=neglogLike, method='BFGS')

 7.0350020 0.5930949

Panel 2.3. R code for numerically maximizing the likelihood function to approximate (/, log a).

MLEs by numerical approximation. For example, let 7 = t(0) and suppose we can compute Y that maximizes L1(y|y) easily; then, by the property of invariance we can deduce that 0 = t-1(y) maximizes L2(0|y) without actually computing the solution of dL2(0|y)/d0 = 0, which may involve numerical difficulties.

To illustrate, let's consider the problem of estimating the probability of occurrence

— that we described in Section 2.3.1.1. The logit function, a one-to-one transformation of —, is defined as follows:

and provides a mapping from the domain of — ([0,1]) to the entire real line. Let 0 = logit(—) denote a reparameterization of —. We can maximize the likelihood of 0 given v to obtain 0 and then calculate — by inverting the transformation as follows:

We will use this inversion often; therefore, in the remainder of this book we let expit(0) denote the function logit-1 (0) as a matter of notational convenience.

Panel 2.4 contains R code for computing 0 by numerical optimization and for computing — by inverting the transformation. Notice that the definition of the R function neglogLike is identical to that used earlier (Panel 2.2) except that we have substituted theta for psi as the function's argument and expit(theta) for psi in the body of the function. Therefore, the extra coding required to compute

— on the logit scale is minimal. Notice also in Panel 2.4 that in maximizing the likelihood function of 0, R did not produce the somewhat troubling warning messages that appeared in maximizing the likelihood function of — (cf. Panel 2.2). The default behavior of R's optimization functions, optim and nlm, is to provide an unconstrained minimization wherein no constraints are placed on the magnitude of the argument of the function being minimized. In other words, if the function's argument is a vector of p components, their value is assumed to lie anywhere in Rp. In our example the admissible values of 0 include the entire real line; in contrast, the admissible values of — are confined to a subset of the real line ([0,1]).

The lesson learned from this example is that when using unconstrained optimization algorithms to maximize a likelihood function of p parameters, one should typically try to formulate the likelihood so that the parameters are defined in Rp. The invariance of MLEs always allows us to back-transform the parameter estimates if that is necessary in the context of the problem.

### 2.3.2.2 Consistency

Suppose the particular set of modeling assumptions summarized in the joint pdf f (y|0) is true, i.e., the approximating model of the data y correctly describes the process that generated the data. Under these conditions, we can prove that 0, the MLE of 0, converges to 0 as the sample size n increases, which we denote mathematically as follows:

Although the assumptions of an approximating model are unlikely to hold exactly, it is reassuring to know that with enough data, the MLE is guaranteed to provide the 'correct' answer.

> neglogLike = function(theta) -dbinom(v, size=n, prob=expit(theta), log=TRUE)

> fit = optim(par=0, fn=neglogLike, method='BFGS')

Panel 2.4. R code for numerically maximizing the likelihood function to estimate 0 = logit(^/>). 2.3.2.3 Asymptotic normality

As in the previous section, suppose the particular set of modeling assumptions summarized in the joint pdf f(y|0) is true. If, in addition, a set of 'regularity conditions' that have to do with technical details2 are satisfied, we can prove the following limiting behavior of the MLE of 0:

In a hypothetical set of repeated samples with 0 fixed and with n ^ ro,

where I(0) = — d loJ|e=£ is called the observed information.

2 Such as identifiability of the model's parameters and differentiability of the likelihood function. See page 516 of Casella and Berger (2002) for a complete list of conditions.

If 0 is a vector of p parameters, then I(0) is a p x p matrix called the observed information matrix.

According to Eq. (2.3.8), the distribution of the discrepancy, 0 — 0, obtained under repeated sampling is approximately normal with mean zero as n ^ ro. Therefore, 0 is an asymptotically unbiased estimator of 0. Similarly, Eq. (2.3.8) implies that the inverse of the observed information provides the estimated asymptotic variance (or asymptotic covariance matrix) of 0.

The practical utility of asymptotic normality is evident in the construction of 100(1 — a) percent confidence intervals for 0. For example, suppose 0 is scalar-valued; then in repeated samples, the random interval

'covers' the fixed value 0 100(1 — a) percent of the time, provided n is sufficiently large. Here, zi-a/2 denotes the (1 —a/2) quantile of a standard normal distribution. Note that Eq. (2.3.9) does not imply that any individual confidence interval includes 0 with probability 1 — a. This misinterpretation of the role of probability is an all-too-common occurrence in applications of statistics. An individual confidence interval either includes 0 or it doesn't. A correct probability statement (or inference) refers to the proportion of confidence intervals that include 0 in a hypothetical, infinitely long sequence of repeated samples. In this sense 1 — a is the probability (relative frequency) that an interval constructed using Eq. (2.3.9) includes the fixed value 0.

Example: estimating the probability of occurrence

As an illustration, let's compute a 95 percent confidence interval for 0, the probability of occurrence, that was defined earlier in an example (Section 2.3.1.1). The information is easily derived using calculus:

The model has only one parameter —; therefore, we simply take the reciprocal of I(—) to compute its inverse. Substituting — for — yields the 95 percent confidence interval for —:

Suppose we had not been able to derive the observed information or to compute its inverse analytically. In this case we would need to compute a numerical approximation of [I(0)]-i. Panel 2.5 contains the R code for computing the MLE of

0 and a 95 percent confidence interval having observed only v = 1 occupied site in a sample of n = 5 sites. As before, we estimate 0 = 0.20. A numerical approximation of I(0) is computed by adding hessian=TRUE to the list of optim's arguments. After rounding, our R code yields the following 95 percent confidence interval for 0: [-0.15, 0.55]. This is the correct answer, but it includes negative values of 0, which don't really make sense because 0 is bounded on [0,1] by definition.

One solution to this problem is to compute a confidence interval for 0 = logit(0) and then transform the upper and lower confidence limits back to the 0 scale (see Panel 2.6). This approach produces an asymmetrical confidence interval for 0 ([0.027, 0.691]), but the interval is properly contained in [0,1].

Another solution to the problem of nonsensical confidence limits is to use a procedure which produces limits that are invariant to reparameterization. We will describe such procedures in the context of hypothesis testing (see Section 2.5). For now, we simply note that confidence limits computed using these procedures and those computed using Eq. (2.3.9) are asymptotically equivalent. The construction of intervals based on Eq. (2.3.9) is far more popular because the confidence limits are relatively easy to compute. In contrast, the calculation of confidence limits based on alternative procedures is more challenging in many instances.

Before leaving our example of interval estimation for 0, let's examine the influence of sample size. Suppose we had examined a sample of n = 50 randomly selected

> neglogLike = function(psi) -dbinom(v, size=n, prob=psi, log=TRUE)

> fit = optim(par=0.5, fn=neglogLike, method='BFGS', hessian=TRUE) Warning messages: 1: NaNs produced in: dbinom(x, size, prob, log) 2:

NaNs produced in: dbinom(x, size, prob, log)

> c(psi.mle-zcrit*psi.se, psi.mle+zcrit*psi.se)

 -0.1506020 0.5506024

Panel 2.5. R code for computing a 95 percent confidence interval for 0.

locations (a tenfold increase in sample size) and had observed the species to be present at v = 10 of these locations. Obviously, our estimate of — is unchanged because — = 10/50 = 0.2. But how has the uncertainty in our estimate changed? Earlier we showed that ) = n/(—(1 — —)). Because — is identical in both samples, we may conclude that the observed information in the sample of n = 50 locations is ten times higher than that in the sample of n =5 locations, as shown in Table 2.3. In fact, this is easily illustrated by plotting the log-likelihood functions for each sample (Figure 2.2). Notice that the curvature of the log-likelihood function in the vicinity of the MLE is greater for the larger sample (n = 50). This is consistent with the differences in observed information because I(—) is the negative of the second derivative of log L(—|v) evaluated at which essentially measures the curvature of log L(—|v) at the MLE. The log-likelihood function decreases with distance from — more rapidly in the sample of n = 50 locations than in the sample of n = 5 locations; therefore, we might expect the estimated precision of — to be higher in the larger sample. This is exactly the case; the larger sample yields a narrower confidence interval for —. Table 2.3 and Figure 2.2 also illustrate the effects of parameterizing the log-likelihood in terms of 0 = logit(—). The increase in observed information associated with the larger sample is the same (a tenfold increase), and this results in a narrower confidence interval. The confidence limits for — based on the asymptotic normality of 0 are not identical to those based on the asymptotic normality of as mentioned earlier; however, the discrepancy between the two confidence intervals is much lower in the larger sample.

> neglogLike = function(theta) -dbinom(v, size=n, prob=expit(theta), log=TRUE)

> fit = optim(par=0, fn=neglogLike, method='BFGS', hessian=TRUE)

> expit(c(theta.mle-zcrit*theta.se, theta.mle+zcrit*theta.se))

 0.02718309 0.69104557

Panel 2.6. R code for computing a 95 percent confidence interval for ^ by back-transforming the lower and upper limits of 6 = logit(^). y = probability of occurrence Figure 2.2. Comparison of log-likelihood functions for binomial outcomes based on different sample sizes, n = 5 (dashed line) and n = 50 (solid line), and different parameterizations, 0 (upper panel) and logit(0) (lower panel). Dashed vertical line indicates the MLE, which is identical in both samples.

Figure 2.2. Comparison of log-likelihood functions for binomial outcomes based on different sample sizes, n = 5 (dashed line) and n = 50 (solid line), and different parameterizations, 0 (upper panel) and logit(0) (lower panel). Dashed vertical line indicates the MLE, which is identical in both samples.

 n