Causes and Solutions for Overdispersion

Hilbe (2007) discriminates between apparent and real overdispersion. Apparent overdispersion is due to missing covariates or interactions, outliers in the response variable, non-linear effects of covariates entered as linear terms in the systematic part of the model, and choice of the wrong link function. These are mainly model misspecifications. There are a couple of interesting examples in Hilbe (pg. 52-61, 2007). For example, he simulates a Poisson variable using five explanatory variables X1 to X5, applies a Poisson model using only explanatory variables X2 to X4, and shows how this causes overdispersion. Similar examples are given for the effects of outliers and using the wrong link function.

Real overdispersion exists when we cannot identify any of the previous mentioned causes. This can be because the variation in the data really is larger than the mean. Or there may be many zeros (which may, or may not, cause overdispersion), clustering of observations, or correlation between observations.

If adding covariates and interactions does not help, there is a quick-fix that can be tried before considering more complicated methods like the negative binomial GLM.

9.7.3 Quick Fix: Dealing with Overdispersion in a Poisson GLM

We can deal with overdispersion in the GLM by using a quasi-Poisson GLM, which consists of the following steps:

1. The mean and variance of Yi are given by EYO = m, and var(Y) = $ x m,.

2. The systematic part is given by n(Xi1,..., Xiq) = a + ยก1 x Xi1 + ... + 3q x Xiq.

3. There is a logarithmic link between the mean of Yi and the predictor function n(Xi1,..., Xiq).

The difference between the Poisson GLM and the Poisson GLM with overdispersion is that we no longer explicitly specify a Poisson distribution, but only a relationship between the mean and variance of y.

Although we do not specify a Poisson distribution, we still use the same type of model structure in terms of the link function and predictor function. If the dispersion parameter $ = 1, we get the same results (in terms of estimated parameters and standard errors) as the Poisson GLM.

If $ > 1, we talk about overdispersion, and if $ < 1, we have underdispersion. The latter means that the variance of the response variable is smaller than you would expect from a Poisson distribution. Reasons for underdispersion are the model is fitting a couple of outliers rather too well or there are too many explanatory variables or interactions in the model (overfitting). If this is not the case, then the consensus is not to correct for underdispersion. Models that take underdispersion into account are discussed in Chapter 7 of Hilbe (2007).

If $ > 1, we need to correct for the overdispersion, which basically means refitting the model, estimating the parameter $, and 'making some corrections'. Before addressing these corrections, we look at the following questions first:

1. How do we estimate the dispersion parameter $?

2. How much larger than 1 should it be before we need to make a correction?

3. What is the effect of introducing a dispersion parameter $?

4. At which point do we decide to do take an alternative approach?

The first question can only be answered in detail towards the end of Section 9.8 because the estimation of $ is based on residuals and we have not yet defined residuals for a GLM. The second question can only be answered in light of the third question. The price we pay for introducing a dispersion parameter $, is that the standard errors of the parameters are multiplied with the square root of $. For example, if $ is equal to 9, then all standard errors are multiplied by 3, and the parameters become less significant. If the parameters of a Poisson GLM are highly significant, then a small correction of the standard errors due to overdispersion, say $ = 1.5, is not going to make any differences in the biological conclusions. But if you have a parameter with a p-value of 0.03, then multiplying the standard error with the square root of 1.5 may change the p-value in something that is no longer significant at the 5% level. So, it all depends: In general a $ larger than 1.5 means that some action needs to be taken to correct it. Various tests for overdispersion are discussed in Hilbe (2007). For the fourth question, if $ is larger than 15 or 20, then you also need to consider other methods (e.g. the negative binomial GLM or zero-inflated models), see the negative binomial model in Section 9.10 and the models for zero-inflated data in Chapter 11.

Was this article helpful?

0 0

Post a comment