Linkage clustering

The process of agglomerative hierarchic clustering is demonstrated for three methods: single-, complete- and average-linkage clustering. Whereas single-and complete-linkage clustering are unambiguous terms, average-linkage clustering is used for different methods. The one presented below is also called centroid clustering, while others are listed in Section 6.4.

All three methods use the same definition for the comparison of any two sampling units. The resemblances (distance or similarity) are always taken from the resemblance matrix of the sample, but methods differ in the way they consider larger groups (Figure 6.3, left side). Single-linkage always uses the distance (or similarity) of the closest members of any two groups.

Figure 6.3 Comparing single- (SL), average- (AL) and complete-linkage (CL) clustering. Left: group definitions applied in a one-dimensional example. Right: results displayed as dendrograms.

Complete-linkage refers to the two most distant members of any two groups, thus distance between groups is much larger. Average-linkage measures the similarity between the centroids of the groups; in the one-dimensional example of Figure 6.3 this is the centre of two (or more) sampling units involved. The definition implies that the similarity between groups is generally greater in complete-linkage than in single-linkage, and intermediate in average-linkage.

The example in Figure 6.3 is one-dimensional. The dissimilarity of any two points is therefore just their distance on a straight line. When arranging the sampling units from left to right, the distance matrix is:

In the first step (numbered arrows in Figure 6.3) all methods do the same. They find the first two points to be the most similar, separated by two units. This yields a first arch in the dendrogram, which is two units in height. In the second step the next closest neighbours are searched for. In single-linkage this is the third point with a distance of three units from the group formed before. For complete-linkage the third point is five units apart from the new group. The next fusion is therefore formed by points three and four being only four units apart. The corresponding arch has height four. The same holds for average-linkage. However, there are two solutions as the distance between the first two points and the third point is also four units. In the third and final step, single-linkage adds the fourth point to the previous cluster. This is done at height four of the arch, according to the distance of the third and the fourth points. In complete-linkage, the two two-member groups are fused at arch level nine, the distance between the most distant points. The same is done in average-linkage, but the height of the arch reflects the distance between the centres of the involved groups.

Although this is a tiny example, the shape of the resulting dendrograms is typical. In single-linkage, the chaining effect can be seen. The dendrogram formed by complete-linkage is more balanced and typically much higher. The average-linkage dendrogram has intermediate shape.

Was this article helpful?

0 0

Post a comment