## Info

y-3s y-2s y-1s y+1s y+2s y+3s y-3s y-2s y-1s y+1s y+2s y+3s

Figure 4.15 The cumulative percentages of data in Fig. 4.14a are plotted here on normal probability paper as a function of the upper limits of classes. Cumulative percentiles are indicated on the right-hand side of the graph. The last data value cannot be plotted on this graph because its cumulated percentage value is 100. The diagonal line represents the theoretical cumulative normal distribution with same mean and variance as the data. This line is positioned on the graph using reference values of the cumulative normal distribution, for example 0.13% at y -3 s and 99.87% at y + 3s , and it passes through the point (y, 50%). This graph contains exactly the same information as Fig. 4.14b; the difference lies in the scale of the ordinate.

y where the x¿ are the ordered observations X < < ... < xn) and coefficients are optimal weights for a population assumed to be normally distributed. Statistic W may be viewed as the square of the correlation coefficient (i.e. the coefficient of determination) between the abscissa and ordinate of the normal probability plot described above. Large values of W indicate normality (points lying along a straight line give r2 close to 1), whereas small values indicate lack of normality. Shapiro & Wilk did provide critical values of W for sample sizes up to 50. D'Agostino (1971, 1972) and Royston (1982a, b, c) proposed modifications to the W formula (better estimates of the weights w), which extend its application to much larger sample sizes. Extensive simulation studies have shown that W is a sensitive omnibus test statistic, meaning that it has good power properties over a wide range of non-normal distribution types and sample sizes.

Which of these tests is best? Reviewing the studies on the power of tests of normality published during the past 25 years, D'Agostino (1982) concluded that the best omnibus tests are the Shapiro-Wilk W-test and a modification by Stephens (1974) of the Anderson-Darling A2-test mentioned above. In a recent Monte Carlo study involving autocorrelated data (Section 1.1), however, Dutilleul & Legendre (1992) showed (1) that, for moderate sample sizes, both the D-test and the W-test are too liberal (in an asymmetric way) for high positive (p > 0.4) and very high negative (p < -0.8) values of autocorrelation along time series and for high positive values of spatial autocorrelation (p > 0.2) and (2) that, overall, the Kolmogorov-Smirnov D-test is more robust against autocorrelation than the Shapiro-Wilk W-test, whatever the sign of the first-order autocorrelation.

As stated at the beginning of the Section, ecologists must absolutely check the normality of data only when they wish to use parametric statistical tests that are based on the normal distribution. Most methods presented in this book, including clustering and ordination techniques, do not require statistical testing and hence may be applied to non-normal data. With many of these methods, however, ecological structures emerge more clearly when the data do not present strong asymmetry; this is the case, for example, with principal component analysis. Since normal data are not skewed (coefficient a3 = 0), testing the normality of data is also testing for asymmetry; normalizing transformations, applied to data with unimodal distributions, reduce or eliminate asymmetries. So, with multidimensional data, it is recommended to check at least the normality of variables one by one.

Some tests of significance require that the data be multinormal (Section 4.3). Section 4.6 has shown that the multidimensional normal distribution contains conditional distributions; it also contains marginal distributions, which are distributions on one or several dimensions, collapsing all the other dimensions. The normality of unidimensional marginal distributions, which correspond to the p individual variables in the data set, can easily be tested as described above. In a multivariate situation, however, showing that each variable does not significantly depart from normality does not prove that the multivariate data set is multinormal although, in many instances, this is the best researchers can practically do.

Test of multi- Dagnelie (1975) proposed an elegant and simple way of testing the multinormality normality of a set of multivariate observations. The method is based on the Mahalanobis generalized distance (D5; Section 7.4, eq. 7.40) which is described in Chapter 7. Generalized distances are computed, in the multidimensional space, between each object and the multidimensional mean of all objects. The distance between object Xi and the mean point X is computed as:

where [y - y] i is the vector corresponding to object Xj in the matrix of centred data and S is the dispersion matrix (Section 4.1). Dagnelie's approach is that, for multinormal data, the generalized distances should be normally distributed. So, the n generalized distances (corresponding to the n objects) are put in increasing order, after which the relative cumulative frequency of each i-th distance is calculated as (i - 0.5)/n. The data are then plotted on a normal probability scale (Fig. 4.15), with the generalized distances on the abscissa and the relative cumulative frequencies on the ordinate. From visual examination of the plot, one can decide whether the data points are well aligned; if so, the hypothesis of multinormality of the original data may be accepted. Alternatively, the list of generalized distances may be subjected to a Shapiro-Wilk test of normality, whose conclusions are applied to the multinormality of the original multivariate data. With standardized variables Zj = (yVl - y}) /s ■, eq. 4.54 becomes:

where R is the correlation matrix.