Direct comparison

Comparison of dendrograms

Figure 10.4 Indirect and direct comparison approaches for analysing and interpreting the structure of ecological data. Single thin arrow: inference of structure. Double arrow: interpretation strategy.

Consensus index

Permutation test

The interpretation of a structure, using the descriptors from which it originates, makes it possible to identify which descriptors mainly account for the structuring of the objects. In some methods of ordination (e.g. principal component analysis, correspondence analysis), the eigenvectors readily identify the important descriptors. Other types of ordination, or the clustering techniques, do not directly provide this information, which must therefore be found a posteriori using methods of indirect comparison. This type of interpretation does not allow one to perform formal tests of significance. The reason is that the structure under study is derived from the very same descriptors that are now used to interpret it; it is thus not independent of them.

Interpretation of a structure using external information (data table X in Fig. 10.4) is central to numerical ecology. This approach is used, for example, to diagnose abiotic conditions (response data table Y) from the available biological descriptors (explanatory data table X) or, alternatively, to forecast the responses of species assemblages (table Y) using available environmental descriptors (table X). In the same way, it is possible to compare two groups of biological descriptors, or two tables of environmental data. Until the mid-1980's, the indirect comparison scheme was favoured because of methodological problems with the classical technique of canonical correlations, which was then the only one available in computer packages to analyse two sets of descriptors. With the availability of new computer programs and methods, the direct comparison scheme is becoming increasingly popular in the ecological literature.

In the indirect comparison approach, the first set of descriptors is reduced to a single or a few one-dimensional variables (i.e. a partition resulting from clustering, or one or several ordination axes, the latter being generally interpreted one at the time). It follows that the methods of interpretation for univariate descriptors may also be used for indirect comparisons. This is the approach used in Tables 10.1 and 10.2.

techniques of matrix comparison (Section 10.5). One may also directly compare dendrograms derived from resemblance matrices, using consensus indices. Two main approaches have been developed to test the significance of consensus statistics: (1) a probability distribution derived for a given consensus statistic may be used, or (2) a specific test may be carried out to assess the significance of the consensus statistic, in which the reference distribution is found by permuting the two dendrograms under study in some appropriate way (Lapointe & Legendre, 1995). Readers are referred to the papers of Day (1983, 1986), Shao & Rohlf (1983), Shao & Sokal (1986), Lapointe & Legendre (1990, 1991, 1992a, 1992b, 1995), and Steel & Penny (1993), where these methods are described. Lapointe & Legendre (1994) used the three forms of direct comparison analysis (i.e. comparison of raw data, distance matrices, and dendrograms; Fig. 10.4) on five data sets describing the same objects. They showed that all methods essentially led to similar conclusions, with minor differences.

1 — Explaining ecological structures

Table 10.1 summarizes the methods available for explaining the structure of one or several ecological descriptors. The purpose here is data exploration, not hypothesis testing. The first dichotomy of the Table separates methods for univariate descriptors (used also in the indirect comparison approach) from those for multivariate data.

Methods used for explaining the structure of univariate descriptors belong to three major groups: (1) measures of dependence, (2) discriminant functions, (3) and methods for qualitative descriptors. Methods used for explaining the structure of multivariate descriptors belong to two major types: (4) canonical analysis methods and (5) matrix comparison methods. The following paragraphs briefly review these five groups of methods, paying special attention to those that are not discussed elsewhere in this book.

1. Various coefficients have been described in Chapters 4 and 5 to measure the dependence between two descriptors exhibiting monotonic relationships (i.e. the parametric and nonparametric correlation coefficients). When there are more than two descriptors, one may use the coefficients of partial correlation or the coefficient of concordance (Section 5.3). The coefficient of multiple correlation (R2), which is derived from multiple regression (multiple linear regression and dummy variable regression), may be used when the response descriptor is quantitative. Dummy variable regression is the same as multiple regression, but conducted on explanatory variables that are qualitative or of mixed levels of precision; the qualitative variables are coded as dummy variables, as explained in Subsection 1.5.7. Finally, in logistic regression, it is possible to compute partial correlation coefficients between the response and each explanatory variables. These different types of regression are briefly discussed in Subsection 2, in relation with Table 10.2, and in more detail in Section 10.3.

2. Explaining the structure of a qualitative descriptor is often called discrimination, when the aim of the analysis is to identify explanatory descriptors that would allow one to discriminate among the various states of the qualitative descriptor. Discriminant analysis may be used when (1) the explanatory (or discriminant) descriptors are quantitative, (2) their distributions are not too far from normal, and (3) the within-group dispersion matrices are reasonably homogeneous. Discriminant analysis is described in Section 11.5. Its use with species data is discussed in Section 11.6, where alternative strategies are proposed.

3. When both the descriptor to be explained and the explanatory descriptors are qualitative, one may use multidimensional contingency table analysis. It is then imperative to follow the rules, given in Section 6.3, concerning the models to use when a distinction is made between the explained and explanatory descriptors. When the response variable is binary, logistic regression may be a better choice than multidimensional contingency table analysis. An additional advantage is that logistic regression allows one to use explanatory variables presenting a mixture of precision

Table 10.1 Numerical methods for explaining the structure of descriptors, using either the descriptors from which the structure originates, or other, potentially explanatory descriptors. In parentheses, identification of the Section where a method is discussed. Tests of significance cannot be performed when the structure of a descriptor is explained by the descriptors at the origin of that structure.

1) Explanation of the structure of a single descriptor, or indirect comparison see 2

2) Structure of a quantitative or a semiquantitative descriptor see 3

3) Explanatory descriptors are quantitative or semiquantitative see 4

4) To measure the dependence between descriptors see 5

5) Pairs of descriptors: Pearson r, for quantitative descriptors exhibiting linear relationships (4.2); Kendall t or Spearman r, for quantitative or semiquantitative descriptors exhibiting monotonic relationships (5.2)

5) A single quantitative descriptor as a function of several others: coefficient of multiple determination R2 (4.5)

5) Several descriptors exhibiting monotonic relationships: coefficient of concordance W(5.2)

4) To interpret the structure of a single descriptor: partial Pearson r, for quantitative descriptors exhibiting linear relationships (4.5); partial Kendall t, for descriptors exhibiting monotonic relationships (5.2)

3) Explanatory descriptors of mixed precision: R2 of dummy variable regression (10.3)

3) Estimation of the dependence between descriptors of the sites and descriptors of the species (any precision level): the 4th-corner method (10.6)

2) Structure of a qualitative descriptor (or of a classification) see 6

6) Explanatory descriptors are quantitative: discriminant analysis (11.5)

6) Explanatory descriptors are qualitative: multidimensional contingency table analysis (6.3); discrete discriminant analysis (10.2)

6) Explanatory descriptors are of mixed precision: logistic regression (in most computer programs, the explained descriptor is binary; 10.3)

1) Explanation of the structure of a multivariate data table see 7

7) Direct comparison see 8

8) Structure of quantitative descriptors explained by quantitative descriptors: redundancy analysis (variables in linear relationships; 11.1); canonical correspondence analysis (species data, unimodal distributions; 11.2)

8) The response and the explanatory data tables are transformed into resemblance matrices, using S or D functions appropriate to their mathematical types: matrix comparison (Mantel test, Procrustes analysis: 10.5)

8) Classifications are computed for the two data tables see 9

9) Partitions are compared: contingency table analysis (6.2), or modified Rand index (8.11)

9) Dendrograms are compared (10.2, Fig. 10.4)

7) Indirect comparison see 10

10) Ordination in reduced space: each axis is treated in the same way as a single quantitative descriptor see 2

10) Clustering: each partition is treated as a qualitative descriptor see 2

levels. For qualitative variables, the equivalent of discriminant analysis is called discrete discriminant analysis. Goldstein & Dillon (1978) describe models used for this analysis and provide Fortran programs.

4. The standard approach for comparing two sets of descriptors is canonical analysis (Chapter 11). The classical method in parametric statistics is canonical correlation analysis (Section 11.4), which may be thought of as two principal component analyses — one on each of the two sets — followed by rotation of the principal axes so as to make them correspond, i.e. maximizing their correlations. Canonical correlations are restricted to quantitative descriptors where the relationships between the two data sets are linear; they may also include binary descriptors, just as in multiple regression. There are two problems with this method in the context of the explanation of ecological structures. (1) The solution of canonical correlations, even when mathematically valid, may not necessarily lead to interesting results because the highest correlations may well be found between axes which are of minor importance for the two data sets. It may be simpler to conduct a principal component analysis that includes both sets of descriptors, whose results would be easier to interpret than those of a canonical correlation analysis. Ecological application 9.1a is an example of this approach. (2) In most instances in ecology, one is not interested so much in correlating two data sets as to explain one using the other. In other words, the questions to be answered focus on one of the two sets, which is thought of as the response, or dependent data set, while the other is the explanatory, or independent data table. The solution to these two problems is found in an indirect comparison approach, where one asks how much of the structure of the response data set is explained by the explanatory data table. Two variants of canonical analysis are now available to do so: redundancy analysis and canonical correspondence analysis (Sections 11.1 and 11.2). The main difference between the two methods is the same as between principal component and correspondence analyses (Table 9.1).

5. Raw data tables may be turned into similarity or distance matrices (Fig. 10.4) when one wishes to express the relationships among objects through a specific measure of resemblance, or because the descriptors are of mixed types; similarity coefficients are available to handle mixed-type data (S15, S16, S19, S20, Chapter 7). Two resemblance matrices concerning the same objects may be compared using matrix correlation (Subsection 8.11.2), that is, by computing a parametric or nonparametric correlation coefficient between corresponding values in these two matrices (excluding the main diagonals). Furthermore, when the two resemblance matrices are independent of each other, i.e. they originate from different data sets, the matrix correlation may be tested for significance using the Mantel test (Section 10.5). In the same way, classifications of objects may be computed from resemblance matrices (Fig. 10.4); two classifications may be compared using appropriate techniques. (1) If one is concerned with specific partitions resulting from hierarchical classifications, or if a non-hierarchical method of classification has been used, one may compare two partitions using contingency table analysis, since partitions are equivalent to qualitative descriptors, or the modified Rand index (Subsection 8.11.2). (2) If one is interested in the relationships depicted by whole dendrograms, cophenetic matrices corresponding to the two dendrograms may be compared and tested for significance using the methods mentioned in the paragraphs where Fig. 10.4 is described. An interesting application of these methods is the comparison of a dendrogram computed from data to a dendrogram taken from the literature.

6. Consider a (site x species) table containing presence-absence data, for which supplementary variables are known for the sites (e.g. habitat characteristics) and for the species (e.g. biological or behavioural traits). The 4th-corner method, described in Section 10.6, offers a way to estimate the dependence between the supplementary variables of the rows and those of the columns, and to test the resulting correlation-like statistics for significance.

2 — Forecasting ecological structures

It is useful to recall here the distinction between forecasting and prediction in ecology. Forecasting models extend, into the future or to different situations, structural relationships among descriptors that have been quantified for a given data set. A set of relationships among variables, which simply describe the changes in one or several descriptors in response to changes in others as computed from a "training set", make Forecasting up a forecasting model. In contrast, when the relationships are assumed to be causal model and to describe a process, the model is predictive. A condition to successful forecasting Predictive is that the values of all important variables that have not been observed (or controlled, model in the case of an experiment) be about the same in the new situation as they were during the survey or experiment. In addition, forecasting does not allow extrapolation beyond the observed range of the explanatory variables. Forecasting models (also called correlative models) are frequently used in ecology, where they are sometimes misleadingly called "predictive models". Forecasting models are useful, provided that the above conditions are fulfilled. In contrast, predictive models describe known or assumed causal relationships. They allow one to estimate the effects, on some variables, of changes in other variables; they will be briefly discussed at the beginning of the next Subsection.

Methods in Table 10.2 are used to forecast descriptors. As in Table 10.1, the first dichotomy in the Table distinguishes the methods that allow one to forecast a single descriptor (response or dependent variable) from those that may be used to simultaneously forecast several descriptors. Forecasting methods belong to five major groups: (1) regression models, (2) identification functions, (3) canonical analysis methods, and (4) matrix comparison methods.

1. Methods belonging to regression models are numerous. Several regression methods include measures of dependence that have already been mentioned in the discussion of Table 10.1: multiple linear regression (the explanatory variables must be quantitative), dummy variable regression (i.e. multiple regression conducted on explanatory variables that are qualitative or of mixed levels of precision; the qualitative variables are then coded as dummy variables, as explained in Subsection 1.5.7), and logistic regression (the explanatory variables may be of mixed

Table 10.2 Numerical methods to forecast one or several descriptors (response or dependent variables) using other descriptors (explanatory or independent variables). In parentheses, identification of the Section where a method is discussed.

1) Forecasting the structure of a single descriptor, or indirect comparison see 2

2) The response variable is quantitative see 3

3) The explanatory variables are quantitative see 4

4) Null or low correlations among explanatory variables: multiple linear regression (10.3); nonlinear regression (10.3)

4) High correlations among explanatory variables (collinearity): ridge regression (10.3); regression on principal components (10.3)

3) The explanatory variables are of mixed precision: dummy variable regression (10.3)

2) The response variable is qualitative (or a classification) see 5

5) Response: two or more groups; explanatory variables are quantitative (but qualitative variables may be recoded into dummy variables): identification functions in discriminant analysis (11.5)

5) Response: binary (presence-absence); explanatory variables are quantitative (but qualitative variables may be recoded into dummy var.): logistic regression (10.3)

2) The response and explanatory variables are quantitative, but they display a nonlinear relationship: nonlinear regression (10.3)

1) Forecasting the structure of a multivariate data table see 6

6) Direct comparison see 7

7) The response as well as the explanatory variables are quantitative: redundancy analysis (variables linearly related; 11.1); canonical correspondence analysis (species presence-absence or abundance data; unimodal distributions; 11.2)

7) Forecasting a resemblance matrix, or a cophenetic matrix representing a dendrogram, using several other explanatory resemblance matrices: multiple regression on resemblance matrices (10.5)

6) Indirect comparison see 8

8) Ordination in reduced space: each axis is treated in the same way as a single quantitative descriptor see 2

8) Clustering: each partition is treated as a qualitative descriptor see 2

levels of precision; the response variable is qualitative; most computer programs are limited to the binary case ). Section 10.3 provides a detailed description of several regression methods.

2. Identification functions are part of multiple discriminant analysis (Section 11.5), whose discriminant functions were briefly introduced in the previous Subsection.

* In the SAS computer package, the standard procedure for logistic regression is LOGIST. One may also use CATMOD, which makes it possible to forecast a multi-state qualitative descriptor.

These functions allow the assignment of any object to one of the states of a qualitative descriptor, using the values taken by several quantitative variables (i.e. the explanatory or discriminant descriptors). As already mentioned in the previous Subsection, the distributions of the discriminant descriptors must not be too far from normality, and their within-group dispersion matrices must be reasonably homogeneous (i.e. about the same among groups).

3. Canonical analysis, and especially redundancy analysis and canonical correspondence analysis, which were briefly discussed in the previous Subsection (and in more detail in Sections 11.1 and 11.2), allow one to model a data table from the descriptors of a second data table; these two data tables form the "training set". Using the resulting model, it is possible to forecast the position of any new observation among those of the "training set", e.g. along environmental gradients. The new observation may represent some condition which may occur in the future, or at a different but comparable location.

4. Finally, resemblance (S or D) and cophenetic matrices representing dendrograms may be interpreted in the regression framework, against an array of other resemblance matrices, using multiple regression on resemblance matrices (Subsection 10.5.2). The permutational tests of significance for the regression parameters (R2 and partial regression coefficients) are performed in the manner of either the Mantel test or the double-permutation test, depending on the nature of the dependant matrix (an ordinary similarity or distance matrix, or a cophenetic matrix).

3 — Ecological prediction

As explained in the Foreword, numerical modelling does not belong to numerical ecology sensu stricto. However, some methods of numerical ecology may be used to analyse causal relationships among a small number of descriptors, thus linking numerical ecology to predictive modelling. Contrary to the forecasting or correlative Predictive models (previous Subsection), predictive models allow one to foresee how some model variables of interest would be affected by changes in other variables. Prediction is possible when the model is based on causal relationships among descriptors (i.e. not only correlative evidence). Causal relationships are stated as hypotheses (theory) for Experiment modelling; they may also be validated through experiments in the laboratory or in the field. In manipulative experiments, one observes the responses of some descriptors to user-determined changes in others, by reference to a control. Besides manipulative experiments, which involve two or more treatments, Hurlbert (1984) recognizes mensurative experiments which involve measurements made at one or more points in space or time and allow one to test hypotheses about patterns in space (Chapter 13) and/or time (Chapter 12). The numerical methods in Table 10.3 allow one to explore a network of causal hypotheses, using the observed relationships among descriptors. The design of experiments and analysis of experimental results are discussed by Mead (1988) who offers a statistically-oriented presentation, and by Underwood (1997) in a book emphasizing ecological experiments.

Table 10.3 Numerical methods for analysing causal relationships among ecological descriptors, with the purpose of predicting one or several descriptors using other descriptors. In parentheses, identification of the Section where a method is discussed. In addition, forecasting methods (Table 10.2) may be used for prediction when there are reasons to believe that the relationships between explanatory and response variables are of causal nature.

1) The causal relationships among descriptors are given by hypothesis see 2

2) Quantitative descriptors; linear causal relationships: causal modelling using correlations (4.5); path analysis (10.4)

2) Qualitative descriptors: logit and log-linear models (6.3)

2) Modelling from resemblance matrices: causal modelling on resemblance matrices (10.5)

1) Hidden variables (latent variables, factors) are assumed to cause the observed structure of the descriptors: confirmatory factor analysis (9.5)

One may hypothesize that there exist causal relationships among the observed descriptors or, alternatively, that the observed descriptors are caused by underlying hidden variables. Depending on the hypothesis, the methods for analysing causal relationships are not the same (Table 10.3). Methods appropriate to the first case belong to the family of path analysis; the second case leads to confirmatory factor analysis. The present Chapter only discusses the former since the latter was explained in Section 9.5. In addition to these methods, techniques of forecasting (Table 10.2) may be used for predictive purposes when there are reasons to believe that the relationships between explanatory and response variables are of causal nature.

Fundamentals of path analysis are presented in Section 10.4. Path analysis is an extension of multiple linear regression and is thus limited to quantitative or binary descriptors (including qualitative descriptors recoded as dummy variables: Subsection 1.5.7). In summary, path analysis is used to decompose and interpret the relationships among a small number of descriptors, assuming (a) a (weak) causal order among descriptors, and (b) that the relationships among descriptors are causally closed. Causal order means, for example, that y2 possibly (but not necessarily) affects y3 but that, under no circumstance, y3 would affect y2 through the same process. Double causal "arrows" are allowed in a model only if different mechanisms may be hypothesized for the reciprocal relationships. Using this assumption, it is possible to set a causal order between y2 and y3. The assumption of causal closure implies independence of the residual causalities, which are the unknown factors responsible for the residual variance (i.e. the variance not accounted for by the observed descriptors). Path analysis is restricted to a small number of descriptors. This is not due to computational problems, but to the fact that the interpretation becomes complex when the number of descriptors in a model becomes large.

When the analysis involves three descriptors only, the simple method of causal modelling using correlations may be used (Subsection 4.5.5). For three resemblance matrices, causal modelling may be carried out using the results of Mantel and partial Mantel tests, as described in Subsection 10.5.2 and Section 13.6.

For qualitative descriptors, Fienberg (1980; his Chapter 7) explains how to use logit or log-linear models (Section 6.3) to determine the signs of causal relationships among such descriptors, by reference to diagrams similar to the path diagrams of Section 10.4.

10.3 Regression

The purpose of regression analysis is to describe the relationship between a dependent (or response) random* variable (y) and a set of independent (or explanatory) variables, in order to forecast or predict the values of y for given values of the independent variables X1, X2, ..., xp. Box 1.1 gives the terminology used to refer to the dependent and independent variables of a regression model in an empirical or causal framework. The explanatory variables may be either random* or controlled (and, consequently, known a priori). On the contrary, the response variable must of necessity be a random variable. That the explanatory variables be random or controlled will be important when choosing the appropriate computation method (model I or II).

Model A mathematical model is simply a mathematical formulation (algebraic, in the case of regression models) of a relationship, or set of relationships among variables, whose parameters have to be estimated, or that are to be tested; in other words, it is a simplified mathematical description of a real-life system. Regression, with its many variants, is the first type of modelling method presented in this Chapter for analysing ecological structures. It is also used as a platform to help introduce the principles of structure analysis. The same principles will apply to more advanced forms, collectively referred to as canonical analysis, that are discussed in Chapter 11.

Regression modelling may be used for description, inference, or forecasting/prediction:

1. Description aims at finding the best functional relationship among variables in the model, and estimating its parameters, based on available data. In mathematics, a function y = /(x) is a rule of correspondence, often written as an equation, that associates with each value of x one and only one value of y. A well-known functional

* A random variable is a variable whose values are assumed to result from some random process (p. 1); these values are not known before observations are made. A random variable is not a variable consisting of numbers drawn at random; such variables, usually generated with the help of a pseudo-random number generator, are used by statisticians to assess the properties of statistical methods under some hypothesis.

Was this article helpful?

0 0

Post a comment