# Descriptors other than species abundances

Why the resemblance between species abundance descriptors must be measured using special coefficients is explained at the beginning of the next Subsection. Measures of resemblance in the present Subsection are used for comparing descriptors for which double-zeros provide unequivocal information (for a discussion of double-zeros in ecology, see the beginning of Section 7.3).

The resemblance between quantitative descriptors can be computed using parametric measures of dependence, i.e. measures based on parameters of the frequency distributions of the descriptors. These measures are the covariance and the Pearson correlation coefficient; they have been described in Chapter 5. They are only adapted to descriptors whose relationships are linear.

The covariance Sjk between descriptors j and k is computed from centred variables (yij - yj) and (yik - yk) (eq. 4.4). The range of values of the covariance has no a priori upper or lower limits. The variances and covariances among a group of descriptors form their dispersion matrix S (eq. 4.6).

Pearson's correlation coefficient j is their covariance of descriptors j and k computed from standardized variables (eqs. 1.12 and 4.7). The coefficients of correlations among a group of descriptors form the correlation matrix R (eq. 4.8). Correlation coefficients range in values between -1 and +1. The significance of individual coefficients (the null hypothesis being generally H0: r = 0) is tested using eq. 4.13, whereas eq. 4.14 is used to test the complete independence among all descriptors.

Q-mode Some authors have used Pearson's r for Q-mode analyses, after interchanging the correlation positions of objects and descriptors in the data matrix. Lefebvre (1980) calls this Q measure the resemblance coefficient. There are at least five objections to this:

• In the R mode, Pearson's r is a dimensionless coefficient (Chapter 3). When the descriptors are not dimensionally homogeneous, the Q-mode correlation coefficient, which combines all descriptors, has complex dimensions that cannot be interpreted.

• In most cases, one may arbitrarily rescale quantitative descriptors (e.g. multiplying one by 100 and dividing another by 10). In the R mode, the value of r remains unchanged after rescaling, whereas doing so in the Q mode may change the value of resemblance between objects in unpredictable and nonmonotonic fashion.

• In order to avoid the two previous problems, it has been suggested to standardize the descriptors (eq. 1.12) before computing correlations in the Q mode. Consider two objects x1 and x2: their similarity should be independent of the other objects in the study; removing objects from the data set should not change it. Any change in object composition of the data set changes the standardized variables, however, and so it affects the value of the correlation computed between x1 and x2. Hence, standardization does not solve the problems.

• Even with dimensionally homogeneous data (e.g. counts of different species), the second objection still holds. In addition, in the R mode, the central limit theorem (Section 4.3) predicts that, as the number of objects increases, the means, variances, and covariances (or correlations) converge towards their values in the statistical population. In the Q mode, on the contrary, adding new descriptors (their positions have been interchanged with that of objects in the data matrix) causes major variations in the resemblance coefficient if these additional descriptors are not perfectly correlated to those already present.

• If correlation coefficients could be used as a general measure of resemblance in the Q mode, they should be applicable in particular to the simple case of the description of the proximities among sites, computed from their geographic coordinates X and Y on a map; the correlations obtained from this calculation should reflect in some way the distances among the sites. This is not the case: correlation coefficients computed among sites from their geographic coordinates are all +1 or -1. As an exercise, readers are encouraged to compute an example of their own.

It follows that the measures designed for R-mode analysis should not be used in the Q mode. Sections 7.3 and 7.4 describe several Q-mode coefficients, whose properties and dimensions are already known or easy to determine.

The resemblance between semiquantitative descriptors and, more generally between any pair of ordered descriptors whose relationship is monotonic may be determined using nonparametric measures of dependence (Chapter 5). Since quantitative descriptors are ordered, nonparametric coefficients may be used to measure their dependence, as long as they are monotonically related.

Two nonparametric correlation coefficients have been described in Section 5.3: Spearman's r and Kendall's t (tau). In Spearman's r (eq. 5.3), quantitative values are replaced by ranks before computing Pearson's r formula. Kendall's t (eqs. 5.5 to 5.7) measures the resemblance in a way that is quite different from Pearson's r. Values of Spearman's r and Kendall's t range between -1 and +1. The significance of individual coefficients (the null hypothesis being generally H0: r = 0) is tested using eq. 5.4 (Spearman's r) or 5.8 (Kendall's t).

As with Pearson's r above, rank correlation coefficients should not be used in the Q mode. Indeed, even if quantitative descriptors are standardized, the same problem arises as with Pearson's r, i.e. the Q measure for a pair of objects is a function of all objects in the data set. In addition, in most biological sampling units, several species are represented by small numbers of individuals. Because these small numbers are subject to large stochastic variation, the ranks of the corresponding species are uncertain in the reference ecosystem. As a consequence, rank correlations between sites would be subject to important random variation because their values would be based on large numbers of uncertain ranks. This is equivalent to giving preponderant weight to the many poorly sampled species.

The importance of qualitative descriptors in ecological research is discussed in Section 6.0. The measurement of resemblance between pairs of such descriptors is based on two-way contingency tables (Sections 6.2 and 6.3), whose analysis is generally conducted using X (chi-square) statistics. Contingency table analysis is also the major approach available for measuring the dependence between quantitative or semiquantitative ordered descriptors that are not monotonically related. The minimum value of X2 is zero, but it has no a priori upper limit. Its formulae (eqs. 6.5 and 6.6) and test of significance are explained in Section 6.2. X2 may be transformed into contingency coefficients (eqs. 6.19 and 6.20), whose values range between 0 and +1.

Two-way contingency tables may also be analysed using measurements derived from information theory. In this case, the amounts of information (B) shared by two descriptors j and k and exclusive to each one (A and C) are first computed. These quantities may be combined into similarity measures, such as S (j, k) = B/(A + B + C) (eq. 6.15; see also eqs. 6.17 and 6.18), or into distance coefficients such as D (j, k) = (A + C)/(A + B + C) (eq. 6.16). The analysis of multiway contingency tables (Section 6.3) is based on the Wilks X2 statistic (eq. 6.6).

A qualitative descriptor (including a classification; Chapter 8) can be compared to a quantitative descriptor using one-way analysis of variance (one-way Anova; Table 5.2 and accompanying text). The classification criterion for this Anova is the qualitative descriptor. As long as the assumptions underlying analysis of variance are met (i.e. normality of within-group distributions and homoscedasticity, Box 1.4), the significance of the relationship between the descriptors may be tested. If the quantitative descriptor does not obey these assumptions or the comparison is between a quantitative and a semiquantitative descriptor, nonparametric one-way analysis of variance (Kruskal-Wallis H test; Table 5.2) is used instead of parametric Anova.