## Info

access the cattle house

Accessible_cattle_house_present

Binary indicator - is such a feed store present?

Accessible_feed_present

Binary indicator - is accessible feed present

Grass_silage

Binary indicator of presence of grass silage

CereaLsilage

Binary indicator of presence cereal silage

HayStraw

Binary indicator of presence of Hay/Straw

CereaLgrains

Binary indicator of presence of cereal grains

Concentrates

Binary indicator of presence of concentrates

Proteinblocks

Binary indicator of presence of protein blocks

Sugarbeet

Binary indicator of presence of sugar beet

Vegetables

Binary indicator of presence of vegetables

Molasses

Binary indicator of presence of molasses

used generalised estimating equations (GEE) and generalised linear mixed models (GLMM). If there would be no temporal auto-correlation, then generalised linear modelling (GLM) can be applied. The underlying GLM, GEE, and GLMM theory was discussed in Chapters 9, 12, and 13.

The aim of this chapter is not to find the best possible model for the data, but merely to contrast GLM, GEE, and GLMM. When writing this chapter, we considered two ways to do this, namely,

1. Apply a model selection in each of the three models (GLM, GEE, and GLMM). It is likely that the optimal GLM consists of a different set of explanatory variables than the GEE and GLMM. The reason for this is the omission of the dependence structure in the data. We have seen this behaviour already in various other examples in this book with the Gaussian distribution. Also, recall the California data set that was used to illustrate GLM and GEE in Chapter 12; the p-values of the GLM were considerably smaller than those of the GEE! Therefore, in a model selection, one ends up with different models. Using this approach, the story of the chapter is then that (erroneously) ignoring a dependence structure gives you a different set of significant explanatory variables.

2. Apply the GLM, GEE, and GLMM on the same set of explanatory variables and compare the estimated parameters and p-values. If they are different (especially if the GLM p-values are much smaller), then the message of the chapter is that ignoring the dependence structure in a GLM gives inflated p-values.

Both approaches are worthwhile presenting, but due to limited space, we decided to go for option 2 and leave the first approach as an exercise to the reader. The question is then: Which GLM model should we select? We decided to adopt the role of an ignorant scientist and apply the model selection using the GLM and contrast this with the GEE and GLMM applied on the same selection of covariates. Note that the resulting GEE and GLMM models are not the optimal models as we are not following our protocol from Chapters 4 and 5, which stated that we should first look for the optimal random structure using a model that contained as many covariates as possible.

### 22.2 Data Exploration

The first problem we encountered was the spreadsheet (containing data on 282 observations), which was characterised by a lot of missing values. Most R functions used so far have options to remove missing values automatically. In this section, we will use the geepack package, and its geeglm function requires the removal of all missing values.

Rows with missing values in the response variable were first removed. Some of the explanatory variables had no missing values at all and other explanatory variables had 71 missing values! Removing every row (observation) that contains a

Table 22.2 Number of missing values per variable. The data set contains 288 rows (observations). The notation '# NAs' stands for the number of missing values. The response variable is Signs_in_yard and contains 6 missing values

Variable

# NAs

Variable

# NAs

Year