## An Informal Explanation

Given a multivariate data set consisting of a number of samples in which many variables have been recorded intuitively, PCA is a process in which the original variable axes are aligned along lines of variation in the data and the values for each sample on those new axes are calculated. For example, consider a data set containing ten samples of abundances of two species. The relative position of each sample in two dimensions can be displayed with a scatterplot of species A versus species B (Figure 1a). A new set of ordination axes can be generated by moving the old axes to the center of the data set by subtracting the mean of each variable from the sample values - a process known as centering - then rotating these axes so that they lie along the major lines of variation (Figure 1b). In PCA, the first axis (principal component 1) lies along the greatest line of variation, the second axis lies along the next greatest line of variation on the condition that it lies at right angles to the first, and so on for subsequent axes. This guarantees a property known as orthogonality; which means that each axis is independent of each other. The next step is to project the sample positions onto these new axes -these axes are called principal components (PCs) (Figure 1c).

In this two-species example, it is possible to display all the variation in a two-dimensional (2D) scatterplot of the original variables. However, if the aim was to explain as much variation as possible in only one dimension (i.e., a line) then the PCs have an important advantage over the variable values. In this example the species explain

ir-4

Principal component 1 (85.5%)

Species B

Species B Species A

Principal component 1 (85.5%)

Species A

Principal component 1 (85.5%)

Figure 1 Deriving principal component (PC) axes of a two-species data set. (a) The original data points can be displayed as a scatterplot of the two species. (b) A new set of axes (PCs) can be derived by placing axes at the center of the data mass, rotating the first axis along the main line of variation in the data set, and rotating the second axis along the next line of variation, conditional on independence with the first axis. (c) The position of the data points on the PCs can be plotted as a reduced space plot. (d) The direction of the original centered species axes can also be projected into the space to generate a biplot.

similar amounts of variation because their variance is approximately the same. At best, about 50% of the entire variation in the data could be displayed in one dimension by plotting values for a single species. In contrast, a plot of the first PCs in this example would explain 85.5% of the total variation in the data set. PCA partitions the variation so that the first PCs will explain more variation than any single variable, assuming there is some correlation between variables. The importance of this is apparent when there are more than two species in a sample. More variation can be presented in a plot of two PCs than can be presented by plotting any pair of species. The PCs can also be interpreted in terms of the original species abundances. A projection of the original centered species axes into the reduced space plot can be used to derive a measure of association of that species with the PC axis (Figure 1d). In this example, samples that lie on the left of the axis had low abundances of both species A and species B, whereas samples that lie to the right of the axis had high numbers of both species. 