PCA was originally developed to describe patterns in multivariate normal (MVN) data. However, deviations from MVN are generally not as critical to the success of the analysis as skewness in the data. PCA should always be preceded by examination of bivariate plots of dependent variables to examine for strong skewness. As with univariate analyses, mathematical transformation may reduce skewness to an acceptable level. However, transformation (and standardization) also changes the importance of variables in defining the ordination space. Untransformed data analyses are usually strongly influenced by the variables with the largest values. As PCA is usually an exploratory technique, conformation to distributional assumptions for inferential reasons is of minor importance but attention should be paid to the numerical scale of the data.
PCA assumes that covariances (or correlations) are good descriptors of patterns in data. This implies that the relationships between dependent variables are linear, or at least monotonic. If organisms respond unimodally to gradients, then PCA might not identify the pattern. In this case an alternative ordination such as 'correspondence analysis' should be considered. PCA also works best when most variables have nonzero values along most of the sample gradient, and the main differences between samples are changes in relative magnitude. In an ecological community context, this would require that most species were found in most of the samples. It is not uncommon to drop rare species (e.g., those that occur in <5-10% of the samples) from the analysis if the zero values are simply a product of sampling variability and low species abundance given sampling effort. However double-zeroes which occur when a set of species that occur at one site are not found at all at another, and conversely the set of species found at the second site are not found at all in the first, can generate problems with PCA. Complete changes in species along sample gradients may generate an arch effect in the ordination. An extreme form of this arch is called a horseshoe and results in samples that are opposite to each other on the ecological gradient appearing closer to each other on the ordination. The Euclidean distance preserved in PCA will underestimate the difference between sites that have few shared species. This problem can be dealt with to a certain extent by choice of transformation. For example, a Euclidean distance of the y'j = y,j/ (^ ^P_ 1 ) transformation preserves the chord distance, which is at its maximum when two samples have no species in common. If transformation cannot preserve the ecologically important properties ofthe analysis, then other ordination methods such as metric or nonmetric scaling of alternative distance measures such as the Bray-Curtis index can be used.
As with most statistical methods that require robust estimation of covariances, such as multiple regression, sample size is an important analytical consideration. PCA generates as many PC axes as there are variables unless there are fewer samples in the data set than there are variables. Most software packages will still calculate the analysis under this condition, with the restriction that the number of PCs will equal the number of samples rather than the number of variables. In general, the first PC axes will still be interpretable but minor axes may not be because of overfitting of the model. This is analogous to multiple regression analysis in which fitting too many variables with too few data points will yield a degenerate solution. Several guidelines have been suggested for establishing appropriate sample sizes to generate robust PCA solutions. Some researchers have suggested that studies should aim to achieve absolute sample sizes ranging from 50 (very poor) through 200 (fair), 300 (good) up to 1000 or (excellent). Others have suggested that the ratio of samples to variables is of more importance, with minimum values of 5:1-10:1. It is important to note that these recommendations have generally been recommended by users of 'factor analysis', in which robust covariance estimates are key to identifying stable analytical solutions. In addition, many of these suggestions stem from the social sciences in which raw data, such as questionnaire responses, are often 'indicators' of variables, rather than direct measures of the variable itself. In most ecological applications these sample sizes are unrealistic and the focus of the analysis is on description of an assemblage, rather than the 'factor analysis' objective of recovering underlying causal factors. For most purposes, a rule of thumb for PCA would be to ensure that there are more replicates than variables. In general, the greater the ratio of replicates to variables, the better. PCA is a data exploration and display tool -not a hypothesis-testing method subject to strict distributional assumptions - so it should yield useful insight into data set structure even if the replication is not as great as desired. Robustness of the PCA solution can always be evaluated by bootstrapping, as described above for eigenvalues and eigenvectors.
It is important to be aware of software idiosyncrasies when calculating PCA. Many software implementations calculate PCA on the correlation matrix by default. In addition, PCA is mathematically related to 'factor analysis' (FA), a method used widely in the social sciences. The main conceptual difference between PCA and FA is that PCA considers the ordination axes as a product of the variables - they are a linear combination of the eigenvectors. FA considers the variables as a product of the axes themselves. In this interpretation, the variable values are 'caused' by the hypothetical ordination axes rather than the axes simply reflecting patterns in the data. The mathematical similarity of the two approaches has led to many software packages combining PCA routines into FA routines. Implementations that incorporate FA and PCA may, by default, yield a covariance biplot of the correlation matrix, with the sample scores scaled to 1 and the eigenvectors scaled to their standard deviation (the factor loading or pattern). FA uses the correlation matrix by default, so as outlined above an analysis of the covariance matrix may substantially change the interpretation of the resulting biplot. FA software also offers the option of axis rotations. Rotations are intended to align axes so that factor loadings are maximized, that is, to make variables associated with single axes. These procedures should not be used for PCA, unless the intent is to use PCA in an exploratory FA and not as a descriptor of ecological data. As with all ecological data analysis, it is important to ensure the correct technique is being employed.
PCA is a very flexible procedure. In its basic form, it is an eigenanalysis of a covariance or correlation matrix. Consequently, it is possible to calculate PCA on a nonparametric correlation matrix such as Spearman's rank correlation. This approach can be useful to deal with non-linearity of variables. It also follows that PCA can also be calculated on a partial correlation or covariance matrix. A partial correlation coefficient is one that has been statistically adjusted for another variable, essentially correcting for a covariate. Ecological applications of partial PCA are rare, but morphometric studies frequently use partial PCA to assess relationships between morphometric variables after correcting for size of the organism. The ecological application is clear. Dominant variables such as wave exposure, moisture gradients, etc., could be effectively removed from the analysis prior to PCA to yield an ordination that statistically 'corrects' for dominant variables.
Although PCA was developed as an ordination method to summarize patterns in multivariate data, it also has a range of other uses. PCA can be used in multiple regression to detect collinearity of predictor variables, and as a variable-reduction tool. For example, collinear variables could be replaced in a multiple regression with their first PC. This reduction in number of regression coefficients will increase the power and stability of a multiple regression by reducing the number of variables and improving independence of the coefficients. PCs can themselves be used in data presentations and analyses. For example, a contour plot of PCs of spatially structured data could provide information on a range of variables in a single graph.
Was this article helpful?
Do You Want To Learn More About Green Living That Can Save You Money? Discover How To Create A Worm Farm From Scratch! Recycling has caught on with a more people as the years go by. Well, now theres another way to recycle that may seem unconventional at first, but it can save you money down the road.