## Hypothesis Testing

The development of formal rules for testing scientific hypotheses has a long history in classical statistics. The application of these rules is especially useful in making evidentiary conclusions (inferences) from designed experiments, where the scientist exercises some degree of control over the relevant sources of uncertainty in the observed outcomes. In a well-designed experiment, hypothesis testing can be used to establish causal relationships between experimental outcomes and the systematic sources of variation in those outcomes that are manipulated as part of the design.

Hypothesis testing also can be applied in observational studies (surveys), where the scientist does not have direct control over the sources of uncertainty being tested. In such studies hypothesis testing may be used to assess whether estimated levels of association (or correlation) between one or more observable outcomes are 'statistically significant' (i.e., are unlikely to have occurred by chance given a particular significance level a of the test). However, in observational studies hypothesis testing cannot be used to determine whether a significant association between outcomes is the result of a coincidence of events or of an underlying causal relationship. Therefore, it can be argued that hypothesis testing is more useful scientifically in the analysis of designed experiments.

We find hypothesis testing to be useful because it provides a general context for the description of some important inference problems, including model selection, construction of confidence intervals, and assessment of model adequacy. We describe and illustrate these topics in the following subsections.

### 2.5.1 Model Selection

In our model-based approach to statistical inference, the classical problem of testing a scientific hypothesis is equivalent to the problem of selecting between two nested models of the data. By nested, we mean that the parameters of one model are a restricted set of the parameters of the other model. To illustrate, suppose we are interested in selecting between two linear regression models, one which simply contains an intercept parameter a and another which contains both intercept and slope parameters, a and respectively. By defining the parameter vector 0 = (a, ^), we can specify the restricted model as 0 = (a, 0) and the full model as

> lambda=50

--------------arguments for R2WinBUGS--------------

> data = list(n=length(y), y=y, lambda=lambda, a=a, b=b)

> sink('MarginalDensity.txt')

--------------call bugs() to fit model-----------------

> library(R2WinBUGS)

> fit = bugs(data, inits, params, model.file='MarginalDensity.txt', debug=F, n.chains=1, n.iter=100000, n.burnin=50000, n.thin=5)

> summary(phi)

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1198 0.2033 0.2227 0.2242 0.2435 0.3541

Panel 2.8. R and WinBUGS code for sampling the posterior distribution of survival probability 0.

0 = (a, ft). Therefore, a decision to select the full model over the restricted model is equivalent to rejecting the null hypothesis that ft = 0 in favor of the alternative hypothesis that ft = 0, given that a is present in both models.

The connection between model selection and hypothesis testing can be specified quite generally. Let H0 and Hi denote two complementary hypotheses which represent the null and alternative hypotheses of a testing problem. In addition, assume that both of these hypotheses can be specified in terms of a (possibly vector-valued) parameter 0, which lies in the parameter space B. Given these definitions, the null and alternative hypotheses can be represented as follows:

Ho : 0 e Bo Hi : 0 e BQ, where B0 is a subset of the parameter space B and BQ is the complement of B0. The model selection problem is basically equivalent to a two-sided hypothesis test wherein

Thus, if we reject H0, we accept the more complex model for which 0 = 00; otherwise, we accept the simpler model for which 0 = 00.

In classical statistics the decision to accept or reject H0 is based on the asymptotic distribution of a test statistic. Although a variety of test statistics have been developed, we describe only two, the Wald statistic and the likelihood-ratio statistic, because they are commonly used in likelihood-based inference.

### 2.5.1.1 Wald test

The Wald test statistic is derived from the asymptotic normality of MLEs. Recall from Eq. (2.3.8) that the distribution of the discrepancy 0 — 0 is asymptotically normal

Suppose we have a single-parameter model and want to test H0 : 0 = 00. Under the assumptions of the null model, the asymptotic normality of 0 implies

or, equivalently,

where SE(0 ) = [/(0 )]-1/2 is the asymptotic standard error of 0 computed by fitting the alternative model H1. The left-hand side of Eq. (2.5.1) is called the Wald test statistic and is often denoted by z. Based on Eq. (2.5.1), we reject H0 when |z| > z1-a/2, where z1-a/2 denotes the (1 — a/2) quantile of a standard normal distribution.

The distribution of the square of a standard normal random variable is a chi-squared distribution with 1 degree of freedom, which we denote by x2(1); therefore, an equivalent test of H0 is to compare z2 to the (1 — a) quantile of a x2(1) distribution. We mention this only to provide a conceptual link to tests of hypotheses that involve multiple parameters. For example, if several parameters are held fixed under the assumptions of H0, the multi-parameter version of the Wald test statistic and its asymptotic distribution are

(0 — 00)'/(0 )(0 — 00) - x2(v), where v denotes the rank of /(0 ) (i.e., the number of parameters to be estimated under the alternative model H1).

### 2.5.1.2 Likelihood ratio test

The likelihood ratio test is rooted in the notion that the likelihood function L(0|y) provides a measure of relative support for different values of the parameter 0. Therefore, in the model selection problem the ratio

provides the ratio of likelihoods obtained by computing the MLE of 00 (the parameters of the model associated with H0) and the MLE of 0 (the parameters of the model associated with H1). Because 00 is a restricted version of 0, L(0 |y) > L(00|y) (by definition), the likelihood ratio must be a fraction (i.e., 0 < A < 1). Thus, lower values of A lend greater support to H1.

Under the assumptions of the null model H0, the asymptotic distribution of the statistic —2 log A is chi-squared with v degrees of freedom

where v equals the difference in the number of free parameters to be estimated under H0 and H1. The left-hand side of Eq. (2.5.3), which is called the likelihood ratio statistic, is strictly positive. In practice, we reject H0 for values of —2 log A that exceed x2-a(v), the (1 — a) quantile of a chi-squared distribution with v degrees of freedom.

The likelihood ratio test can be used to evaluate the goodness of fit of a model of counts provided the sample is sufficiently large. In this context H1 corresponds to a 'saturated' model in which the number of parameters equals the sample size n. We cannot learn anything new from a saturated model because its parameters essentially amount to a one-to-one transformation of the counts y; however, a likelihood ratio comparison between the saturated model and an approximating model H0 can be used to assess the goodness of fit of H0. For example, suppose the approximating model contains k free parameters to be estimated; then, the value of the likelihood ratio statistic —2 log A can be compared to xl-a(n — k) to determine whether H0 is accepted at the a significance level. If H0 is accepted, we may conclude that the approximating model provides a satisfactory fit to the data. In the context of this test the likelihood ratio statistic provides a measure of discrepancy between the counts in y and the approximating model's estimate of y; consequently, —2 log A is often called the deviance test statistic, or simply the deviance, in this setting. We will see many applications of the deviance test statistic in later chapters.

2.5.1.3 Example: Mortality of moths exposed to cypermethrin

We illustrate model selection and hypothesis testing using data observed in a dose-response experiment involving adults of the tobacco budworm (Heliothis virescens), a moth species whose larvae are responsible for damage to cotton crops in the United States and Central and South America (Collett, 1991, Example 3.7). In the experiment, batches of 20 moths of each sex were exposed to a pesticide called cypermethrin for a period of 72 hours, beginning two days after the adults had emerged from pupation. Both sexes were exposed to the same range of pesticide doses: 1, 2, 4, 8, 16, and 32 ^g cypermethrin. At the end of the experiment the number of moths in each batch that were either knocked down (movement of moth was uncoordinated) or dead (moth was unable to move and was unresponsive to a poke from a blunt instrument) was recorded.

The experiment was designed to test whether males and females suffered the same mortality when exposed to identical doses of cypermethrin. The results are shown in Figure 2.6 where the empirical logit of the proportion of moths that died in each batch is plotted against log2(dose), which linearizes the exponential range of doses. The empirical logit, which is defined as follows

(wherein yi (i = 1,..., 12) denotes the number of deaths observed in the ith batch of N = 20 moths per batch), is the least biased estimator of the true logit of the proportion of deaths per batch (Agresti, 2002). We use the empirical logit because it allows outcomes of 100 percent mortality (y = N) or no mortality (y = 0) to be plotted on the logit scale along with the other outcomes of the experiment.

The empirical logits of mortality appear to increase linearly with log2(dose) for both sexes (Figure 2.6); therefore, we consider logistic regression models as a reasonable set of candidates for finding an approximating model of the data. Let xi denote the log2(dose) of cypermethrin administered to the ith batch of moths that contained either males (zj = 1) or females (z = 0). A logistic-regression model containing 3 parameters is yi |N, pi ~ Bin(N,pj) logit(pj) = a + ftxj + YZj, where a is the intercept, ft is the effect of cypermethrin and y is the effect of sex.

Let 0 = (a, ft, 7) denote a vector of parameters. The relevant hypotheses to be examined in the experiment are

H0 : 0 = (a, ft, 0) H1 : 0 = (a, ft, y), where y = 0 in H1. In other words, the test of H0 amounts to selecting between two models of the data: the null model, wherein only a and ft are estimated, and the alternative model, wherein all 3 parameters are estimated.

To conduct a Wald test of H0, we need only fit the alternative model, which yields the MLE 0 = (-3.47,1.06,1.10). We obtain SE(7) = 0.356 from the inverse of the observed information matrix, and we compute a Wald test statistic of z = (7 — 0)/SE(y) = 3.093. Because |z| > 1.96, we reject H0 at the 0.05 significance level and select the alternative model of the data in favor of the null model.

To conduct a likelihood ratio test of H0, we must compute MLEs for the parameters of both null and alternative models. Using these estimates, we obtain log L(a,ft|y) = -23.547 for the null model and logL(a,ft,Y|y) = -18.434 for the alternative model. Therefore, the likelihood ratio statistic is -2log(A) = -2{log L(a, ft|y) - log L(a, ft, 7|y)} = 10.227. The number of parameters estimated under the null and alternative models differ by v = 3 - 2=1; therefore, to test H0 we compare the value of the likelihood ratio statistic to x2 95(1) = 3.84. Since 10.227 > 3.84, we reject H0 at the 0.05 significance level and select the alternative model of the data in favor of the null model.

The null hypothesis is rejected regardless of whether we use the Wald test or the likelihood ratio test; therefore, we may conclude that the difference in mortality of male and female moths exposed to the same dose of cypermethrin in the experiment is statistically significant, a result which certainly appears to be supported by the data in Figure 2.6. Base 2 logarithm of cypermethrin dose

Figure 2.6. Mortality of male (circle) and female (triangle) moths exposed to various doses of cypermethrin. Lines indicate the fit of a logistic regression model with dose (log2 scale) and sex as predictors.

### Base 2 logarithm of cypermethrin dose

Figure 2.6. Mortality of male (circle) and female (triangle) moths exposed to various doses of cypermethrin. Lines indicate the fit of a logistic regression model with dose (log2 scale) and sex as predictors.

We may also use a likelihood ratio test to assess the goodness of fit of the model that we have selected as a basis for inference. In this test the parameter estimates of the alternative ('saturated') model H correspond to the observed proportions of moths that died in the 12 experimental batches, i.e., 0 = (p 1,p2,...,p 12) = (y1/N, y2/N,..., y12/N). For this model log L(p 1,...,p 12|y) = -15.055; therefore the likelihood ratio statistic for testing goodness of fit is —2{log L(a, / ,7|y) — log L(p1,...,p 12 |y) = 6.757}. To test H0, we compare this value to Xo95(9) = 16.92. Since 6.757 < 16.92, we accept H0 and conclude that the model with parameters (a,/, y) cannot be rejected for lack of fit at the 0.05 significance level.

2.5.2 Inverting Tests to Estimate Confidence Intervals

In many studies a formal test of the statistical significance of an effect (e.g., a treatment effect in a designed experiment) is less important scientifically than an estimate of the magnitude of the effect. This is particularly true in observational studies where the estimated level of association between one or more observable outcomes is of primary scientific interest. In these cases the main inference problem is to estimate a parameter and to provide a probabilistic description of the uncertainty in the estimate.

In classical statistics the solution to this problem involves the construction of a confidence interval for the parameter. However, a variety of procedures have been developed for constructing confidence intervals. For example, in Section 2.3.2

we described how the asymptotic normality of MLEs provides a (1 — a) percent confidence interval of the form

for a scalar-valued parameter 0 wherein SE(0) = [/(0 )]-1/2 (cf. Eq. (2.3.9)). It turns out that this confidence interval may be constructed by inverting the Wald test described in Section 2.5.1. To see this, note that the null hypothesis H0 : 0 = 00 is accepted if

Simple algebra can be used to prove that the range of 00 values that satisfy this inequality are bounded by the confidence limits given in Eq. (2.5.4); consequently, there is a direct correspondence between the acceptance region of the null hypothesis (that is, the values of 0 for which H0 : 0 = 00 is accepted) and the (1 — a) percent confidence interval for 0.

Confidence intervals also may be constructed by inverting the likelihood ratio test described in Section 2.5.1. To see this, note that the null hypothesis H0 : 0 = 00 is accepted if

—2{logL(00|y) — logL(0 |y)} - x2-a(v)• (2.5.5)

Therefore, the range of the fixed parameters in 0 0 that satisfy this inequality provide a (1 — a)% confidence region for those parameters. Such confidence regions are often more difficult to calculate than those based on inverting the Wald test because the free parameters in 00 must be estimated by maximizing L(00|y) for each value of the parameters in 00 that are fixed. For this reason L(00|y) is called the profile likelihood function of 00. If a confidence interval is required for only a single parameter, a numerical root-finding procedure may be used to calculate the values of that parameter that satisfy Eq. (2.5.5).

In sufficiently large samples, confidence intervals computed by inverting the Wald test or the likelihood ratio test will be nearly identical. An obvious advantage of intervals based on the Wald test is that they are easy to compute. However, in small samples, such intervals can produce undesirable results, as we observed in Table 2.3, where an interval for the probability of occurrence 0 includes negative values. Intervals based on inverting the likelihood ratio test can be more difficult to calculate, but an advantage of these intervals is that they are invariant to transformation of a model's parameters. Therefore, regardless of whether we parameterize the probability of occurrence in terms of 0 or logit(0), the confidence intervals for 0 will be the same. Table 2.4 illustrates the small-sample benefits of computing intervals for 0 by inverting the likelihood ratio test.

 n Wald test Likelihood ratio test