Transformation

3.1 Data types

As mentioned in Chapter 1, the aim of measurement is to generate a numerical description of the real world. This sounds like a merely technical issue; on closer inspection, however, data often mirror the tool that has been used for the measurement. We measure what we can measure and we omit what we cannot. Sometimes we also have a choice in the method we use to obtain some particular information, as for example in measuring the colour of light. We can either use a scale with discrete states (red, blue, yellow, etc.) or measure the wavelength of electromagnetic radiation. In the first case the measurement addresses a type of colour, in the second we get a number, representing a totally different data type. We need to distinguish different data types as their numerical analyses require different treatments. In some cases the transformation of one type into another may be necessary

(e.g. in Table 3.3). Some textbooks distinguish between quite a few; however, a very simple classification would be the one in Figure 3.1:

Nominal data are recorded according to a list of possible states. Four leaf types are distinguished in Figure 3.1 and labelled by letters. These are (a) Quercus petraea, (b) Q. pubescens, (c) Q. robur and (d) Q. cerris. Data of this type are restricted in the application of mathematical operations. Leaf types are either the same or different, thus the operations to be applied are = and =.

Ordinal data are measurements on a rank scale. The three plant species noted in the centre of Figure 3.1 flower at different times of the year. In a warm winter, flowering of Corylus avellana may start in December of the preceding year. However, if cold weather conditions prevail the first flowers may show in late February. Yet the order remains always the same: Corylus will flower before Tussilago and Prunus will be the latest. Hence, there is a natural order irrespective of weather conditions. The operations applicable to nominal data also apply for ordinal data. In addition, calculating a difference in ranks makes sense. A large difference in ranks usually means lower similarity of the two elements.

Metric data are measurements of distance, volume, weight, force and so on. The example in Figure 3.1 shows the height and diameter of a tree. In metric data all arithmetic operations make sense, including the ones allowed for the previously mentioned types. For example, the height and stem diameter of a tree allows calculation of the approximate volume of the trunk.

One simple rule for the transformation of data types concerns the direction in which this is done. It is easy to transform from metric to ordinal and further to nominal (with loss of information, however), but the opposite direction requires additional assumptions about the meaning of the measurements. This is a common practice when analysing plant cover-abundance data, as will be shown in Section 3.4. The transformations presented in the following two sections apply to ordinal and metric data only. In classical statistics (Sampford 1962) there are formal rules that have to be applied when using transformation, such as correcting for non-normal distributions of the data. In fact, transformation generally is used to adapt data to statistical models. Yet I present a slightly different view here: attributes are measured at a specific scale (given by the measuring device used). This scale does not necessarily serve the objective of the investigation. Often, the perspective has to be adjusted: one metre when seen from two metres away may appear large, but when seen from one kilometre's distance will hardly be visible. Hence, when talking about transformation, we will have to keep the purpose of our measurement in mind.

When transformations are applied to individual measurements, I call them scalar transformations. Scalar transformation means that the scale used for measuring is adjusted according to our intention. Such transformations are widespread in environmental science. Often a relationship between two variables only emerges after proper transformation. Figure 3.2 illustrates this in a biological example. It is generally assumed that the survival of plant and animal populations depends on appropriate environmental conditions. When the conditions are favourable, populations may grow. Under less favourable conditions, they are likely to remain small. A small population may, for example, consist of five individuals. But 'large' is not, say, 20, but 100 or even more. When correlating population size with an environmental

Biological population |
« | ||

Population size n |
5 |
25 |
100 |

nominal |
small |
medium |
large |

rank |
1 |
2 |
3 |

n' = n025 |
1.49 |
2,23 |
3.16 |

Figure 3.2 Scalar transformation of population size to optimize for correlation with environmental factors.

Figure 3.2 Scalar transformation of population size to optimize for correlation with environmental factors.

variable, for example temperature, a transformed number of individuals may be a better measure of population size. When taking n' = n0 25 for example, we adopt a more qualitative view of the size: 5 will become 1.49 (small), 20 will be 2.23 (average) and 100 is 3.16 (large). Correlating these values with temperature could easily yield a good linear relationship.

Another way of reasoning is that scalar transformation changes the perspective of objects: in many ways they appear smaller when seen from a distance, as illustrated in Figure 3.3: the trees are just a series of points in two-dimensional space, connected by a line. On the left, the coordinates are untransformed and all trees have the same height. In the middle and on the right the coordinates have been transformed and this obviously affects the perspective by reducing the importance of high values compared to low values.

Transformation may sometimes contribute to the solution of problems inherent in ecosystems, such as poor correlation of species occurrence under

Figure 3.3 Scalar transformation of the coordinates of a graph. These transformations affect the perspective adopted during the course of the analysis.

6 11 16 21 26 31 36 Location along a hypothetical gradient

Figure 3.4 Overlap of two species with Gaussian response along a hypothetical gradient. Left graph: species scores on a 0-10 performance scale. Right graph: the same scores, but square-root transformed.

6 11 16 21 26 31 36 Location along a hypothetical gradient

Figure 3.4 Overlap of two species with Gaussian response along a hypothetical gradient. Left graph: species scores on a 0-10 performance scale. Right graph: the same scores, but square-root transformed.

similar site conditions (Chapter 1). Despite the hope of many practitioners that species will form groups, thereby enabling the identification of vegetation types, reality differs. When inspecting synoptic tables (Section 6.6) many species overlap nicely, but they hardly ever cover the same niche. Even worse, apparently species tend to avoid common distribution (Clarke 1993). As claimed by Gleason (1926, 1939) in his 'individualistic concept of the plant association', species behave like loners. And in fact if the formation of an ecological niche is the result of Darwinian struggle for life then species are prone to ecological differentiation. I attempted to sketch a typical case of two overlapping species in Figure 3.4. The response of both species to the hypothetical gradient is Gaussian (Section 8.2.2). Despite the shifted optima there is a small area of overlap. On the right, the same situation is shown, but this time with performance scores square-root transformed to let the high scores shrink. Transformation in this case affects the relative overlap as this is now larger than in the left graph. In practice this may be most welcome as co-occurrence measures of species are often unpleasantly low. As will be shown later (Section 7.2.3), transformations towards presence-absence are frequently a good choice when revealing ecological patterns.

As shown in Section 2.3.2 data are traditionally organized in two-dimensional data matrices. The column vectors are the sampling units and the row vectors are the attributes. Transformation of vectors therefore concerns rows, columns or both simultaneously. The aim in either case resides in obtaining similar properties of vectors. When sampling unit transformation, it is frequently the intention to achieve equal weight of all

Table 3.1 Effects of different vector transformations on the properties of data.

Term Formula Explanation

Term Formula Explanation

Table 3.1 Effects of different vector transformations on the properties of data.

Centring |
xi |
— xi x |
Adjusts mean to zero |

Normalizing |
xi |
VX2 |
Adjusts vector length to 1.0 |

Standardizing |
x' |
.V/ T |
Adjusts mean to zero and |

y » |
variance to 1.0 | ||

Range adjustment |
xmax -xmin |
This is a fuzzy transformation (range 0.0-1.0) |

xi |
X2 |
X3 |
X4 |
X5 |
E |
X |
Hx2 |
VÊ7 |
Sx | |

Raw |
2.00 |
0.00 |
5.00 |
4.00 |
6.00 |
17.00 |
3.40 |
81.00 |
9.00 |
2.15 |

0.00 |
0.00 |
1.00 |
2.00 |
2.00 |
5.00 |
1.00 |
9.00 |
3.00 |
0.89 | |

Centred |
-1.40 |
-3.40 |
1.60 |
0.60 |
2.60 |
0.00 |
0.00 |
23.20 |
4.82 |
2.15 |

-1.00 |
-1.00 |
0.00 |
1.00 |
1.00 |
0.00 |
0.00 |
4.00 |
2.00 |
0.89 | |

Normalized |
0.22 |
0.00 |
0.56 |
0.44 |
0.67 |
1.89 |
0.38 |
1.00 |
1.00 |
0.24 |

0.00 |
0.00 |
0.33 |
0.67 |
0.67 |
1.67 |
0.33 |
1.00 |
1.00 |
0.30 | |

Standardized |
-0.65 |
-1.58 |
0.74 |
0.28 |
1.21 |
0.00 |
0.00 |
5.00 |
2.24 |
1.00 |

-1.12 |
-1.12 |
0.00 |
1.12 |
1.12 |
0.00 |
0.00 |
5.00 |
2.24 |
1.00 | |

Fuzzyfied |
0.33 |
0.00 |
0.83 |
0.67 |
1.00 |
2.83 |
0.57 |
2.25 |
1.50 |
0.36 |

0.00 |
0.00 |
0.50 |
1.00 |
1.00 |
2.50 |
0.50 |
2.25 |
1.50 |
0.45 |

samples. Attribute transformation results in obtaining the same potential weight in describing the sampling units. Some of the most frequently applied vector transformations are shown in Table 3.1, with a numerical example given in Table 3.2.

A first step, rarely used alone, is centring. The mean of the vector is deduced from each element. As a result, the new mean and the new sum both become zero. The sum of squares also changes, without becoming zero. The variance, however, remains unchanged.

Normalizing is a different method of transformation. Each element of the vector is divided by its (Euclidean) length. The vector sum, the vector mean change and the vector length are now 1.0. As shown in Table 3.2, the vectors become more similar in many ways while the variances still differ.

A most rigorous transformation is standardizing. This is a combination of centring and normalizing. As a result, the vector mean is zero and the standard deviation (and the variance) becomes 1.0. The length of the vector is equal to the square root of the number of elements. Standardization is used to compare different scaled measurements, such as temperature and the height of trees, for example. However, standardization has a downside: if the information is hidden in the variance then it will be lost.

Fuzzyfying is a simple transformation (Boyce & Ellison 2001). The elements are adjusted to range from zero (lowest score) to 1.0 (highest score). It should be used only if you intend to adopt this view of the data. Aberrant values can set the boundaries in an undesirable way, deteriorating the observations completely. Fuzzy transformation is not an alternative to normalizing or standardizing, but rather is applied in combination with these.

3.4 Example: Transformation of plant cover data

In phytosociology, Braun-Blanquet (1932) established a scale for measuring the quantity of plant species - that is, species performance - in vegetation releves. He released his first comprehensive book on that topic in 1928 (English version in 1932). From the point of view of modern data analysis this scale (the so-called cover-abundance scale) is a mixture of form and content. At lower species densities, it expresses the abundance of individuals. At high densities, it directly translates to plant cover percentage. As shown in Table 3.3, it starts with a nominal notation in the form of the symbol 'empty' (in Table 3.3 a minus sign), followed by '+'. Then it continues with a rank scale from 1 to 5. In the past hundred years, huge data sets have been collected all over the globe using this scale (Dengler et al. 2008). Handling such data is therefore an issue in data analysis. Table 3.3 demonstrates how it could be done based on an idea published by Maarel (1979).

In the first step the code is transformed into a proper rank scale with a range from 0 to 6 (column three in Table 3.3). The ranks are then treated as if they were metric. The justification for this is shown in the right-hand columns, where the rank scale is further transformed according to:

Table 3.3 Transformation of cover-abundance values in phytosociology.

Table 3.3 Transformation of cover-abundance values in phytosociology.

- |
0 |
0 |
0 |
0 |
0 |

+ |
<1 |
1 |
1 |
1 |
1 |

1 |
5 |
2 |
1.07 |
1.19 |
5.65 |

2 |
17.5 |
3 |
1.12 |
1.31 |
15.58 |

3 |
37.5 |
4 |
1.15 |
1.41 |
32.00 |

4 |
62.5 |
5 |
1.17 |
1.50 |
55.90 |

5 |
87.5 |
6 |
1.19 |
1.57 |
88.18 |

where x' is the transformed score. When y < 1 the data approach a binary state {0, 1}. Near y = 2.5 it can be seen that this approximates the initial cover percentages. By choosing the appropriate value for y the scope of the analysis can hence be altered to emphasize either the qualitative or the quantitative aspect. For many applications, choosing y = 0.25 turns out to be a good compromise as this expresses the qualitative view while considering the quantitative sufficiently as well (see Section 7.2.3).

Was this article helpful?

## Post a comment