The p-value of null hypothesis testing is equal to the probability of obtaining the observed data or more extreme data if the null hypothesis is true. For example, consider the null hypothesis that the exponent of the scaling relationship between metabolic rate and body size is 0.75. Then, we collect some data on metabolic rate and body size and estimate the value of the exponent as 0.77, leading to a difference between the null hypothesis and the estimate of 0.02. However, given the variation expected in the data, a difference this large might be expected just by chance. The p-value is the probability of getting a difference this big or bigger if the null hypothesis is true.
Critics of null hypothesis testing ask: 'Why should data that have never been observed (e.g. the occurrence of an exponent greater than 0.77) influence our inference about the validity of the null hypothesis?' This seems to be a reasonable concern. It is easy to construct examples in which the observed data are impossible if the null hypothesis is true, but where the p-value is not zero because more extreme data are possible (e.g. a null hypothesis of an odd number of breeding birds in a monogamous species).
In practice, most null hypotheses predict unimodal distributions for the data, with the most common form being a normal distribution or a similar distribution derived from the normal (e.g. t or chi-squared distribution). As a result, there is usually a monotonic relationship between the probability of obtaining the observed data and the p-value. As the probability of observing the data increases, so too does the p-value. Therefore, the influence of the 'unobserved results' is usually small. For the example in Box 2.1, the different stopping rule led to different interpretations of what constituted more extreme data. The subsequent difference in the p-value was relatively small (0.033 versus 0.073), although large enough that the result of the hypothesis test was affected in this case.
Was this article helpful?