Info

Using the formulae for the Euclidean (D1, eq. 7.34) and %2 (D16, eq. 7.54) distances, one can verify that the Euclidean distances among the rows of matrix F are equal to the %2 distances among the columns of the original data table (Table 9.11):

Matrix F thus provides a proper ordination of the columns of the original data matrix (species abundance classes in the numerical example).

3 — Interpretation

The relationship between matrices V and V, which provide the ordinations of the columns and rows of the contingency (or species data) table, respectively, is found by combining eqs. 9.38, 9.41, and 9.42 in the following expression:

This equation means that the ordination of the rows (matrix V) is related to the ordination of the columns (matrix V), along principal axis h, by the value Jkh which is a measure of the "correlation" between these two ordinations. Value (1 - Ah) actually measures the difficulty of ordering, along principal axis h, the rows of the contingency table from an ordination of the columns, or the converse (Orloci, 1978). The highest eigenvalue (0.096 in the above numerical example), or its square root = 0.31), is consequently a measure of the dependence between two unordered descriptors, to be added to the measures described in Chapter 6. Williams (1952) discusses different methods for testing the significance of R2 = A.

Examination of joint plots (e.g. Fig. 9.16) allows one to draw conclusions about the ecological relationships displayed by the data. With scaling type 1, (a) the distances among rows (or sites in the case of a species x sites data table) in reduced space approximate their %2 distances and (b) the rows (sites) are at the centroids of the columns (species). Positions of the centroids are calculated using weights equal to the relative frequencies of the columns (species); columns (species) that are absent from a row (site) have null weights and do not contribute to the position of that row (site). Thus, the ordination of rows (sites) is meaningful. In addition, any row (site) found near the point representing a column (species) is likely to have a high contribution of that column (species); for binary (or species presence-absence) data, the row (site) is more likely to possess the state of that column (or contain that species).

With scaling type 2, it is the distances among columns (species) in reduced space that approximate their %2 distances, whereas columns (species) are at the centroids of the rows (sites). Consequently (a), the ordination of columns (species) is meaningful, and (b) any column (species) that lies close to the point representing a row (site) is more likely to be found in the state of that row (site), or with higher frequency (abundance) than in rows (sites) that are further away in the joint plot.

For species presence-absence or abundance data, insofar as a species has a unimodal (i.e. bell-shaped) response curve along the axes of ecological variation corresponding to the ordination axes, the optimum for that species should be close to the point representing it in the ordination diagram and its frequency of occurrence or abundance should decrease with distance from that point. Species that are absent at most sites often appear at the edge of the scatter plot, near the point representing a site where they happen to be present — by chance, or because they are favoured by some rare condition occurring at that site. Such species have little influence on the analysis because their numerical contributions are small (column sums in Table 9.11). Finally, species that lie near the centre of the ordination diagram may have their optimum in this area of the plot, or have two or several optima (bi- or multi-modal species), or else be unrelated to the pair of ordination axes under consideration. Species of this last group may express themselves along some other axis or axes. Close examination of the raw data table may be required in this case. It is the species found away from the centre of the diagram, but not near the edges, that are the most likely to display clear relationships with the ordination axes (ter Braak, 1987c).

In Fig. 9.16 (a and b), the first CA axis (70.1% of the variance) orders the abundances in a direction opposite to that of temperatures. Both graphs associate abundance (0) to the highest temperature (3), abundance (+) to the intermediate temperature (2), and abundance (++) to the lowest temperature (1). An analysis of the correspondence between rows and columns of the contingency table following the methods described in Section 6.4 would have shown the same relationships.

4 — Site x species data tables

Correspondence analysis has been applied to data tables other than contingency tables. Justification is provided by Benzecri and coll. (1973). Notice, however, that the elements of a table to be analysed by correspondence analysis must be dimensionally homogeneous (i.e. same physical units, so that they can be added) and non-negative (> 0, so that they can be transformed into probabilities or proportions). Several data sets already have these characteristics, such as (bio)mass values, concentrations, financial data (in \$, £, etc.), or species abundances.

Other types of data may be recoded to make the descriptors dimensionally homogeneous and positive; the most widely used data transformations are discussed in Section 1.5. For descriptors with different physical units, the data may, for example, be standardized (which makes them dimensionless; eq. 1.12) and made positive by translation, i.e. by subtracting the highest negative value; or divided by the maximum or by the range of values (eqs. 1.10 and 1.11). Data may also be recoded into ordered classes. Regardless of the method, recoding is then a critical step of correspondence analysis. Consult Benzecri and coll. (1973) on this matter.

Inflated data table

Several authors, mentioned at the beginning of this Section, have applied correspondence analysis to the analysis of site x species matrices containing species presence/absence or abundance data. This generalization of the method is based on the following sampling model. If sampling had been designed in such a way as to collect individual organisms (which is usually not the case, the sampled elements being, most often, sampling sites), each organism could be described by two descriptors: the site where it was collected and the taxon to which it belongs. These two descriptors may be written out to an inflated data table which has as many rows as there are individual organisms. The more familiar site x species data table would then be the contingency table resulting from crossing the two descriptors, sites and taxa. It could be analysed using any of the methods applicable to contingency tables. Most methods involving tests of statistical significance cannot be used, however, because the hypothesis of independence of the individual organisms, following the model described above, is not met by species presence-absence or abundance data collected at sampling sites.

Niche theory tells us that species have ecological preferences, meaning that they are found at sites where they encounter favourable conditions. This statement is rooted in the idea that species have unimodal distributions along environmental variables (Fig. 9.12), more individuals being found near some environmental value which is "optimal" for the given species. This has been formalised by Hutchinson (1957) in his Niche fundamental niche model. Furthermore, Gause's (1935) competitive exclusion principle suggests that, in their micro-evolution, species should have developed non-overlapping niches. These two principles indicate together that species should be roughly equally spaced in the n-dimensional space of resources. This model has been used by ter Braak (1985) to justify the use of correspondence analysis on presence-absence or abundance data tables; he showed that the X2 distance preserved through correspondence analysis (Table 9.1) is an appropriate model for species with unimodal distributions along environmental gradients.

Let us follow the path travelled by Hill (1973b), who rediscovered correspondence analysis while exploring the analysis of vegetation variation along environmental Reciprocal gradients; he called his method "reciprocal averaging" before realizing that this was averaging correspondence analysis (Hill, 1974). Hill started from the simpler method of gradient analysis, proposed by Whittaker (1960, 1967) to analyse site x species data tables. Gradient analysis uses a matrix Y (site x species) and an initial vector v of values vj which are ascribed to the various species j as indicators of the physical gradient to be evidenced. For example, a score (scale from 1 to 10) could be given to the each species for its preference with respect to soil moisture. These coefficients are used to calculate the positions of the sites along the gradient. The score vi of a site i is calculated as the average score of the species (j = 1 ... p) present at that site, using the formula:

where yy is the abundance of species j at site i and yi+ is the sum of the organisms at this site (i.e. the sum of values in row i of matrix Y).

Gradient analysis produces a vector v of the positions of the sites along the gradient under study. Hill (1973b, 1974) suggested to continue the analysis, using now vector v of the ordination of sites to compute a new ordination (v) of the species:

n in which y+j is the sum of values in column j of matrix Y. Alternating between v and v (scaling the vectors at each step as shown in step 6 of Table 9.12) defines an iterative procedure that Hill (1973b) called "reciprocal averaging". This procedure converges towards a unique unidimensional ordination of the species and sites, which is independent of the values initially given to the Vj's; different initial guesses as to the values Vj may however change the number of steps required to reach convergence. Being aware of the work of Clint & Jennings (1970), Hill realized that he had discovered an eigenvalue method for gradient analysis, hence the title of his 1973b paper. It so happens that Hill's method produces the barycentred vectors v and v for species and sites, that correspond to the first eigenvalue of a correspondence analysis. Hill (1973b) showed how to calculate the eigenvalue (A) corresponding to these ordinations and how to find the other eigenvalues and eigenvectors. He thus created a simple algorithm for correspondence analysis (described in Subsection 7).

When interpreting the results of correspondence analysis, one should keep in mind that the simultaneous ordination of species and sites aims at determining how useful the ordination of species is, as a whole, for predicting the ordination of the sites. In other words, it seeks the predictive value of one ordination with respect to the other. Subsection 3 has shown that, for any given dimension h, (1 - Xh) measures the difficulty of ordering, along principal axis h, the row states of the contingency table from an ordination of the column states, or the converse. The interpretation of the relationship between the two ordinations must be done with reference to this statistic.

When it is used as an ordination method, correspondence analysis provides an ordination of the sites which is somewhat similar to that resulting from a principal component analysis of the correlation matrix among species (standardized data). This is to be expected since the first step in the calculation actually consists in weighting each datum by the sums (or the relative frequencies) of the corresponding row and column (eq. 9.32 and 9.33), which eliminates the effects due to the large variances that certain rows or columns may have. In the case of steep gradients (i.e. many zeros in the data matrix), correspondence analysis should produce a better ordination than PCA

(Hill, 1973b). This was also shown by Gauch et al. (1977) using simulated and