Unweighted centroid clustering UPGMC

Centroid The centroid of a cluster of objects may be imagined as the type-object of the cluster, whether that object actually exists or is only a mathematical construct. In A-space (Fig. 7.2), the coordinates of the centroid of a cluster are computed by averaging the coordinates of the objects in the group.

Unweighted centroid clustering (Lance & Williams, 1967c; "UPGMC" in Sneath & Sokal, 1973: "Unweighted Pair-Group Centroid Method") is based on a simple geometric approach. Along a decreasing scale of similarities, UPGMC proceeds to the fusion of objects or clusters presenting the highest similarity, as in the previous methods. At each step, the members of a cluster are replaced by their common centroid (i.e. "mean point"). The centroid is considered to represent a new object for the remainder of the clustering procedure; in the next step, one looks again for the pair of objects with the highest similarity, on which the procedure of fusion is repeated.

Gower (1967) proposed the following formula for centroid clustering, where the similarity of the centroid (hi) of the objects or clusters h and i with a third object or cluster g is computed from the similarities S(h, g), S(i, g), and S(h, i):

wh w, whwi S (hi, g) =w-+wT S (h, g) +—+- S (i, g) +-h--—2 [ 1-S (h, i)] (8.3)

were the w's are weights given to the clusters. To simplify the symbols, letters g, h, and i are used here to represent three objects considered in the course of clustering; g, h, and i may also represent centroids of clusters obtained during previous clustering steps. Gower's formula insures that the centroid hi of objects (or clusters) h and i is geometrically located on the line between h and i. In classical centroid clustering, the numbers of objects nh and ni in clusters h and i are taken as values for the weights wh and wi; these weights are 1 at the start of the clustering because there is then a single

Table 8.4 Weighted arithmetic average clustering (WPGMA) of the pond data. At each step, the highest similarity value is identified (italicized boldface value) and the two corresponding objects or groups are fused by averaging their similarities (boxes).


Was this article helpful?

0 0

Post a comment