## Principal Components Analysis and Correspondence Analysis

One way of deriving a set of ordination axes that summarize main patterns in ecological data is to rotate the original variable axes to a new configuration, using a regression-type approach to align the new ordination axes along major trends in the data. A common method to do this is principal components analysis. This approach was implicit in the informal example above (Figure 1), and described in further detail elsewhere (see Principal Components Analysis). Principal components analysis (PCA) takes a sample*variable data table and finds a set of axes that define an ordination space that maximizes variance explained on the first axis, with subsequently smaller orthogonal (i.e., independent of previously extracted axes) fractions of variance explained on subsequent axes. A plot of the samples on the first few axes generates a 'reduced space plot', which is the primary output of PCA. As an example, triplefin (Pisces: Tripterygiidae) fish abundance was sampled at a range of sites with different exposure and location characteristics in northeastern New Zealand. The data were first \ -transformed (the Euclidean distance of sample*variable data transformed by y y yj where yj is the dependent variable value in the ith row and jth column, yi+ is the row total, and y+j- is the column total, gives the distance between samples). This transformation was used to enable comparison with another ordination that will be introduced later - correspondence analysis (CA) - which implicitly preserves distance. PCA of these data identified a gradient in triplefin assemblages across exposure gradients from sheltered to exposed sites, with assemblages on offshore exposed and sheltered mainland sites distinct from each other, and the semiexposed and mainland sites (Figure 2a). To interpret the reduced space plot, the eigenvectors from the PCA were graphed. The eigenvectors represent a projection of the original species axes into the graph and thus show which species contribute to the reduced space. In this example, we see that the pattern is driven by the relative numerical dominance of three species. Forsterygion varium was characteristic of mainland exposed/semiexposed sites, Notoclinops segmentatus was characteristic of exposed sites, regardless of mainland or offshore status, and Forsterygion lapillum was characteristic of sheltered sites (Figure 2d). Together these graphs form a 'biplot'.

PCA implicitly reflects linear responses ofvariables to gradients. In one sense, it rotates the original set ofortho-gonal (i.e., at right angles, and hence independent to each other) variable axes to lie along the main axis of the data cloud on the first axis, with subsequent axes progressively 'dividing' the data cloud into orthogonal trends. As a consequence, each ordination axis maximally separates samples along its length. This implicitly suggests that the patterns identified by PCA are linear patterns - a sample that lies further away from the origin in the same direction as a variable eigenvector will have a higher value of that variable. This linearity assumption is also reflected in the interpretation of the dependent variable

0.010

%

.6

4.

0.005

CM

n

e

n o

0.000

p

m

o

c

al ip

-0.005

ci

in

Pr

 ■ o o o

N. segmentatus K.sl <M 0.25

-0.010 -0.005 0.000 0.005 0.010 0.015 0.020 Principal component 1 (53.0%)

0.75

0.50

N. segmentatus K.sl <M 0.25

F. lapillum

F. lapillum ra -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 Principal component 1 (53.0%)

 ▼ ■ o o° o

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Dimension 1 (46.9%) n io

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Dimension 1 (46.9%)

o J9

 N. caerulepunctus K. stewarti F. flavonig N. yaldwyni N. segmentatu rum R. decemdigitatus F. lapillum R. whero s P. latic F. malco B. lesleyae varium nne
 K. stewarti N. caerulepunctus F. flavonigrum N. yaldwyni F. lapillum N. segmentatus P. laticlaviuss^^ R F. malcolmi R. R. decemdigitatus B. lesleyae m

Correlation with axis 1

Correlation with axis 1

Figure 2 Reduced space ordination plots ((a) PCA, (b) CA, and (c) nonmetric multidimensional scaling (nm-MDS)) of triplefin fish assemblages at sites of varying exposure (circles, sheltered mainland; diamonds, semiexposed mainland; squares, exposed mainland; inverted triangles, exposed offshore). Data were transformed by yij = yijKyi+^y+j) for PCA and nm-MDS to preserve the x2 distance and enable comparison with CA, which implicitly preserves weighted x2 distance. PCA was calculated on the covariance matrix of the transformed data, and a two-dimensional nm-MDS calculated on the Euclidean distance of the transformed data. Interpretation of the PCA is provided by a plot of the eigenvectors (d), and indicates that the relative dominance of F. lapillum, F. varium, and N. segmentatus characterized the difference between sites. Together, Figures 2a and 2d form a Euclidean biplot. Interpretation of CA is provided by the projection of the species profiles onto the ordination space (e). In contrast with PCA eigenvectors, the species are displayed as points, which represent the 'center' of their distributions. Forsterygion lapillum and R. decemdigitatus typify sheltered sites, whereas N. caerulepunctus and K. stewarti typify exposed offshore sites. Together, Figures 2b and 2e form a joint plot. While there is no formal way of presenting species information in MDS, if the axes have been rotated by PCA the correlation between the variables and the positions of sites in ordination space can be displayed as a post hoc interpretation (f). This approach is similar to a factor loading plot (see Principal Components Analysis). All reduced space plots are very similar, reinforcing the point that the behavior of an ordination is primarily a function of the transformation and distance preserved, not the analysis method itself.

eigenvectors. The position of each sample in reduced space is the sum of the values of each variable, multiplied by the eigenvectors. It is important to note that PCA (as with most ordinations) is sensitive to data scaling. A PCA can be carried out on either a covariance or correlation matrix. If run on a covariance matrix, the variables with larger values will contribute more to the analysis, whereas a correlation matrix (equal to the covariance of standardized data) will enable variables with different means and standard deviations to contribute equally to the analysis (see Principal Components Analysis).

While PCA maximizes the dispersion between samples (and hence the variance explained), this is not the only criterion that can be used to generate an ordination space. An alternate approach is to develop an ordination that maximizes the joint correlation (or 'correspondence') between the sample positions in a reduced space, with the positions of the variable scores (e.g., species) in the same space. This approach is used in CA, and differs in several respects from PCA. CA takes a sample*variable data table and finds a set of axes that best describes the lack of independence between the rows and columns of the table. As with PCA, the first axis describes the main pattern of nonindependence, with subsequently smaller orthogonal (i.e., independent of previously extracted axes) smaller fractions of variance explained on subsequent axes. The most notable difference is that CA does not assume linear changes in species abundance along the ordination axes; rather, it models the center of the distribution of a species across samples. In doing so, it generates an ordination that can represent unimodal gradients in species abundance. CA was developed for contingency tables, in which the data are counts of occurrences of sample*variable combinations and consequently the data typically consist of integers or zeroes - not truly continuous as for PCA data tables - and so the Euclidean distance preserved in PCA may not be appropriate. A more ecologically appropriate way of viewing such data is to compare the profiles of counts between samples (weighted by their abundance), rather than their Euclidean distance apart. The distance achieves this, and is the multivariate distance implicitly preserved in CA.

CA takes a sample*variable table and calculates the position of each sample that maximizes the dispersion (or variability) of each variable score, under the proviso that the variable score is itself a weighted average of the sample scores. In this way, samples and variables are treated as relatively symmetric entities, in contrast with the PCA approach of treating variables as linear predictors of the position of a sample in multivariate space. As with PCA the primary output of interest is a plot of the samples and the variables in a reduced space ordination. However, the interpretation of the plot is rather different. In a PCA Euclidean biplot the distance between samples approximates their Euclidean distance apart, whereas in a CA joint plot the distance between samples approximates their \ distance apart. In PCA, if samples lie in the same direction from the origin as a variable eigenvector, then they are assumed to have a linear increase in the value of that variable. The variable values in a CA joint plot are the centroids of the species values, not the direction of their greatest increase. If a sample is close to a variable on the plot, then it should have a high value of that variable. It should be noted that these interpretations may depend on the scaling used to generate the joint plot. It is possible to rescale the plot to display the sample centroids, and the distance between species if that is the aim of the analysis.

A CA of the untransformed triplefin data set, as with the PCA of the \ -transformed data, identifies a gradient in triplefin assemblages across exposure gradients from sheltered to exposed sites, with assemblages on offshore exposed and sheltered mainland sites distinct from the semiexposed and exposed mainland sites (Figure 2b). This similarity with PCA is due to the \ transformation used in the PCA although the graphs are not exactly the same due to the weighting implicit in CA. The plot of the species however is different from that of the PCA. While PCA identified the pattern as being due to changes in a three-species dominance, CA identified that some of these patterns were due to differences in abundance ofrelatively uncommon species such as Notoclinops caerulepunctus and Karalepis stewarti at exposed offshore sites, Ruanoho decem-digitatus at sheltered sites, and Grahamina nigripenne at mainland sites (Figure 2e). CA can be sensitive to the presence of rare species and multivariate outliers.

How Many Axes to Plot?

Both PCA and CA generate as many ordination axes as there are variables in the data set. However, only a few axes are likely to reflect trends or patterns in the data. Deciding how many axes to graph is based on a plot of the eigenvalues of each axis by its rank order. This is termed a scree plot. The eigenvalue in PCA reflects the amount of variability in the data set explained by an ordination axis, whereas the eigenvalue in CA reflects the amount of non-independence (i.e., correlation) between sample and variable scores in the reduced space. Regardless of the method, examining the scree plot and the difference between variance explained/row-column correlation with subsequent axes will usually enable a researcher to identify which axes represent patterns (indicated by sharp differences in eigenvalues of axes in decreasing order) versus noise (little difference with subsequent axes) (see Principal Components Analysis).

Arches, Horseshoes, and Detrending

Ordinations using PCA and CA may be subject to an effect known as an arch, in which a plot of samples in ordination space of axis 1 versus axis 2 displays a curvilinear pattern rather than a linear gradient along the axis (an 'arch'), or in an extreme form a bending in which samples that are very dissimilar are closer to each other than should be expected (a 'horseshoe'). This effect (also known as the Guttman effect) is a result of the limitations of the different distance measures preserved by PCA and CA when samples have been taken across long environmental gradients and the composition of the community undergoes large changes. The Euclidean distance preserved by PCA is insensitive to double-zeroes - so samples that share no species in common appear to be more similar to each other, thus generating an arch in the ordination. In CA, an additional effect is introduced because a second ordination axis, independent of the first, can be generated by folding the first axis in half, thus placing distinct samples in close proximity to each other, generating a horseshoe. The easiest solution in these cases is simply to ignore the second axis, and plot the first versus the third axis, and accept that nonlinea-rities occur in nature and that the first two axes are collectively measuring a pattern that cannot be represented in Euclidean space.

In the CA literature, a common approach is to detrend the ordination to recover a linear gradient. Detrended CA proceeds by calculating a CA, then calculating a running average of the second axis to flatten the arch. There are two strategies in common use: detrending by segments, in which the first axis is divided into blocks and the average of the second axis within the block is subtracted from the second axis to flatten the data; and detrending by polynomials in which a polynomial is fitted to the relationship between axis 2 and the values of axis 1, and the residuals of axis 2 plotted in place of its original values. Detrending alters the interpretation of the joint plot - the distances between points are no longer \ ■ While detrending is popular in some ecological subdisciplines, the need for it depends largely on whether the arch effect is viewed as a problem. If the community is believed to respond to a linear gradient, in a somewhat nonlinear way, then detrending is a reasonable option. If a simple description of the community is desired, but the primary pattern in the data is multidimensional and hence cannot be represented on just the first axis, then there is no need to detrend. The second axis can be ignored, and a plot of the first and third axes used to describe the data structure. 