## Cleveland Dotplot To Assess For Issues With Outliers In Explanatory Variables

Concentration

Cleveland dotplot Concentration

Fig. 2.1 A: Cleveland dotplot for Nereis concentration. B: Conditional Cleveland dotplot of Nereis concentration conditional on nutrient with values 1, 2 and 3. Different symbols were used, and the graph suggests violation of homogeneity. The x-axes show the value at a particular observation, and the y-axes show the observations

Concentration

Fig. 2.1 A: Cleveland dotplot for Nereis concentration. B: Conditional Cleveland dotplot of Nereis concentration conditional on nutrient with values 1, 2 and 3. Different symbols were used, and the graph suggests violation of homogeneity. The x-axes show the value at a particular observation, and the y-axes show the observations

In a dotchart, the first row in the text file is plotted as the lowest value along the y-axis in Fig. 2.1A, the second observation as the second lowest, etc. The x-axis shows the value of the concentration for each observation. By itself, this graph is not that spectacular, but extending it by making use of the grouping option in dotchart (for further details type: ?dotchart in R) makes it considerably more useful, as can be seen from Fig. 2.1B. This figure was produced using the following command:

> dotchart(Nereis\$concentration, groups = factor(Nereis\$nutrient), ylab = "Nutrient", xlab = "Concentration", main = "Cleveland dotplot", pch = Nereis\$nutrient)

The groups = factor(nutrient) bit ensures that observations from the same nutrient are grouped together, and the pch command stands for point character. In this case, the nutrient levels are labelled as 1, 2 and 3. If other characters are required, or nutrient is labelled as alpha-numerical values, then you have to make a new column with the required values. To figure out which number corresponds to a particular symbol is a matter of trial and error, or looking it up in a table, see, for example, Venables and Ripley (2002).

Cleveland dotplots are useful to detect outliers and violation of homogeneity. Homogeneity means that the spread of the data values is the same for all variables, and if this assumption is violated, we call this heterogeneity. Points on the far end along the horizontal axis (extremely large or extremely small values) may be considered outliers. Whether such points are influential in the statistical analysis depends on the technique used and the relationship between the response and explanatory variables. In this case, there are no extremely large of small values for the variable concentration values. The Cleveland dotplot in Fig. 2.1B indicates that we may expect problems with violation of homogeneity in a linear regression model applied on these data, as the spread in the third nutrient is considerable smaller than that in the other two. The mean concentration value of nutrient two seems to be larger, indicating that in a regression model, the covariate nutrient will probably play an important role.

2.1.2 Pairplots

Another essential data exploration tool is the pairplot obtained by the R command

> pairs(Nereis)

The resulting graph is presented in Fig. 2.2. Each panel is a scatterplot of two variables. The graph does not show any obvious relationships between concentration and biomass, but there seems to be a clear relationship between concentration and concentration o c\i