Delphinusdelphis Phocoenaphocoena Stenellafrontalis
Fig. 20.3 Boxplot of age conditional of species. There is considerably less between-species variation compared to between-animal variation indicate that the variation in ages recorded in Scotland is considerably less than Spain, indicating we may need to use different variances per country. There are also three observations with undetermined sex, and to allow for interactions between sex, location, and age determination, we removed these observations.
The following R code was used to access the data, make the two boxplots, and remove the three observations where sex was undetermined. By now, you should be familiar with this code. The object Cetaceans2 contains the male and female data (the class unknown was dropped).
> library(AED); data(Cetaceans)
> Cetaceans$fSpecies <- factor(Cetaceans$Species)
> Cetaceans$fDolphinID <- factor(Cetaceans$DolphinID)
> boxplot(Age ~ fSpecies, data = Cetaceans)
> boxplot(Age ~ fDolphinD), data = Cetaceans)
20.3 Data Analysis
The starting point for the analysis is a model of the form
The variable Age,jk is the age of observation i in animal j of species k. The index k runs from 1 to 6 and i from 1 to 3. The number of animals per species differs. We start discussing the fixed part of the model and then the random part. Recall from Chapters 4 and 5 that the protocol dictates that we start with a model that contains as many fixed explanatory variables as possible. In this case, we have three nominal explanatory variables. We therefore start with a model containing sex, stain, and location as main terms, all two-way interactions, and the three-way interaction. Hence, the fixed part consists of
Sexjk + Stainjk + Locationjk + Sexjk x Stainjk + Sexjk x Locationjk+ Stain,jk x Location,^ + Sexjk x Stainjk x Location^
This model is fitted with the gls function to serve as a reference model. The following code was used for this.
> Cetaceans2$fSex <- factor(Cetaceans2$Sex)
> Cetaceans2$fLocation <- factor(Cetaceans2$Location)
> Cetaceans2$fStain <- factor(Cetaceans2$Stain)
> f1 <- formula(Age ~ fSex * fStain * fLocation)
> M1 <- gls(f1, method = "REML", data = Cetaceans2)
We can now go to step 2 of the analysis. The random effect 'animal' is nested within the random effect 'species'. Just as West et al. (2006), we argue that if the random effect 'animal' is included in the model, then the random effect 'species' should also be included in the model. Making our starting point for the random part, ak + aj\k + Sijk
The term ejk is the unexplained error and represents the within-animal variation. It is assumed to be normally distributed with mean 0 and variance a2. However, the data exploration indicated that there may be different spread per location and we should be prepared at some stage to test whether multiple variances are needed per location. But to avoid too many steps at once, we will wait until we reach steps 3-5 of the analysis before considering this in any detail.
Recall that the index k refers to species k. We assume that ak is normally distributed with mean 0 and variance aspecies2. The term ak is a random intercept and allows for variation between the species. The amount of variation is determined by aspecies2. The term aj|k looks intimidating, but represents the variation between animals (index j) of the same species (index k). We assume it is normally distributed with mean 0 and variance aanimal2. Summarising, ak allows for variation between the species and aj|k for the variation between animals within the same species.
Therefore, our starting model contains a sex, location, and stain effect as well as all their interactions, and we also use random intercepts that model between-species variation and between-animal variation within the species.
As part of this analysis (step 2), the first model comparison is between the model with the two random effects ak and aj|k and a model without them. Recall that these random effects are nested. If the between-animal variation is important, then we should use both random effects. So, we will not test whether the random effect ak on its own is important. The following code applies the model with both random effects and compares the model with and without the random effects using the anova command.
> M2 <- lme(fi, random =~1 | fSpecies / fDolphinID, data = Cetaceans2, method = "REML")
It is important that you define the variables fSpecies and fDolphinID as factors before the lme command or R will give an error message. The output of the anova command is as follows:
Model df AIC BIC logLik Test L.Ratio p-value
Ml 1 13 1101.4488 1141.8261 -537.7244
M2 2 15 740.3277 786.9168 -355.1638 1 vs 2 365.1212 <.0001
The AIC indicates that the model with the two random effects is considerably better. The likelihood ratio statistic is L = 365.12, and the cited p-value indicates that we can reject the null hypothesis H0: aanimal = 0 in favour of the alternative H1: aanimal > 0. However, note that we are testing on the boundary, and therefore, the cited critical p-value should be multiplied with 0.5; see also Chapter 5 or Chapter 4 in West et al. (2006). Even after applying this correction, we still come to the same conclusion that we need the two random intercepts.
We now have two options. We can either apply a model validation, check for homogeneity (especially plotting residuals versus location), or extend the model by allowing for multiple variance based on location and see whether it improves the model. The motivation for the last approach is because the data exploration showed a clear difference in spread per location. Recall from Chapter 4 that adding such a variance structure extends the model to ei]k - N(0, as2)
The index s refers to the two locations, allowing the residuals from the two locations to have a different spread. Based on the data exploration, we decided to include the multiple variance structure and see whether it improved the model. Some might argue that this variance structure should have been used in the starting model, but there are two reasons for not doing this; firstly because we prefer to start as simple as possible and secondly the explanatory variables could have explained the differences in spread. The following R code was used to extend the model.
> M3 <- lme(fi, random =~ 1 | fSpecies / fDolphinID, weights = varIdent(form =~ 1 | fLocation), data = Cetaceans2)
The only new bit of code is the weights option with the varIdent variance structure (see also Chapter 4). The anova command shows that the AIC of this model is 733.01 and the likelihood ratio statistic is L = 9.30 (df = 1, p = 0.002), making this the best model so far. We can now proceed to steps 7-9 of the analysis to find the optimal fixed structure for our selected random structure.
To find the optimal model in terms of the fixed explanatory variables, we can either use the t-statistics, sequential F-tests, or likelihood ratio tests. In this instance, as we have factors with more than two levels (stain), we decided to use the third option. This part of the analysis was described earlier in Chapter 5, and we assume the reader is familiar with the tedious repetitive process of fitting a full model dropping all allowable terms in turn, applying likelihood ratio tests of nested models dropping the least significant term, and repeating the whole process until all terms are significant.
We first fitted a model with all terms (main terms: all two-way interactions and the three-way interaction) and then a model without the three-way interaction. Both models were fitted with maximum likelihood estimation (ML). The likelihood ratio test indicated that we could drop the three-way interaction (L = 5.05, df = 2, p = 0.07). The process then continued by dropping each of the three two-way interaction terms in turn and identified the least significant with the likelihood ratio test. The first interaction term to go out was the sex x location term (L = 0.35, df = 1, p = 0.55), followed by the sex x stain interaction (L = 1.5, df = 2, p = 0.46), and finally, sex as a main term was dropped (L = 0.68, df = 1, p = 0.40) as it was not included in the remaining two-way interaction term. At this point, the fixed part of the model contained stain, location, and the stain x location interaction. Dropping the interaction gave L = 19.14 (df = 2, p < 0.001), giving the optimal model in terms of fixed terms. In words, it is given by
Ageijk = Stainjk + Location^ + Stain,^ x Location^ + ak + aj\k + Sijk
We refitted the model with REML and applied a model validation. There are no problems with homogeneity. The numerical output of the model is obtained with the summary command:
Linear mixed-effects model fit by REML Data: Cetaceans2
AIC BIC logLik 734.7 766.1 -357.4
Random effects: Formula: ~1 | fSpecies (Intercept) StdDev: 1.285
Formula: ~1 | fDolphinID %in% fSpecies (Intercept) Residual StdDev: 5.503 0.6496
Variance function: Structure: Different standard deviations per stratum Formula: ~1 | fLocation Parameter estimates: Scotland Spain 1.000 1.596 Fixed effects: list(f1)
Was this article helpful?