## P3B3 Hypergeometric Distribution

In section P3.3.1, we mentioned that the binomial distribution arises when sampling n individuals from a population that has a proportion p of individuals of a certain type. Technically, this claim is true only if each individual sampled is replaced before the next individual is sampled, otherwise p will change as the sample is gathered, causing the outcome of each trial to depend on the outcomes of previous trials. If sampling occurs without replacement, the hypergeometric distribution describes the distribution of possible samples.

Definition P3.6:

The hypergeometric distribution describes the probability of observing k "ones" in a sample of size n, which is randomly drawn without replacement from a population of size N:

where Nx = N p is the number of "ones" and Nz = N (1 - p) is the number of "zeros" in the total population before sampling.

The probability distribution for a hypergeometric distribution looks complicated but it can be derived by counting up all of the types of samples that could occur (Box P3.1). The denominator represents the number of different ways (i.e., the number of combinations; Box P3.1) in which n individuals can be chosen without replacement from a population of size N, regardless of whether they are successes or failures. For example, there are three ways to chose two individuals (n = 2) from a population of size three (N = 3): either the first, the second, or the third individual can be left out. Out of all of these possibilities, we then need to count up all of those instances in which there were exactly k successes and n - k failures. Moving to the numerator, the quantity (Kk') is the number of ways (i.e., the number of combinations) in which k successes can be drawn (without replacement) from the subpopulation of Nx successes without caring about the order in which they occur. For each of these, there are then (n-2*) different ways (i.e., combinations) in which the desired n - k failures can be drawn (without replacement) from the subpopulation of N2 failures. Thus the total number of ways in which we can obtain exactly k successes and n - k

failures is ()( ). Consequently, of the ways that we could sample n individuals from a population, only (^(„^j of these will contain k successes and n - k failures. The fraction of samples with k successes is thus given by Definition P3.6.

Using Definitions P3.2 and P3.3, the mean and variance of a hypergeomet-ric random variable are

The mean is the same as a binomial random variable (P3.3). But, the variance is a factor (N — n)/(N - 1) smaller than the variance of a binomial random variable. The variance decreases toward zero as the sample size approaches the population size (rt N), because the composition of the sample becomes nearly the same as the composition of the whole population. Conversely, if the sample size is very small relative to the population size (n « N), then (N - n)/(N - 1) approaches one, and the hypergeometric distribution converges upon the binomial distribution.

### Example

Imagine that you are studying the nesting behavior of puffins on an island, which contains N = 100 suitable nesting cavities. Of these nesting cavities, 30 are on a cliff face that is inaccessible to mammalian predators, while the remainder are on a grassy slope. You watch as the first n = 20 puffins choose cavities and begin nesting, and you observe that k = 11 choose cliff sites. Thus, among the first nesters, you observe a higher proportion (11/20 = 55%) using cliff sites than expected on the basis of the proportion of cliff sites (30/100 = 30%). The hypergeometric distribution can be used to determine the probability of observing exactly k = 11 nesting on the cliff: