Cluster Analysis 9
Download 1,02 Mb.
|
Cluster Analysis9
partitions of the data in Stata), and
provide an initial grouping variable that defines the groups among the objects to be clustered. The group means (or medians) of these groups are used as the starting centers (Group means from partitions defined by initial grouping variables in Stata). After the initialization, k-means successively reassigns the objects to other clusters with the aim of minimizing the within-cluster variation. This within-cluster variation is equal to the squared distance of each observation to the center of the associated cluster (i.e., the centroid). If the reallocation of an object to another cluster decreases the within-cluster variation, this object is reassigned to that cluster. Since cluster affiliations can change in the course of the clustering process (i.e., an object can move to another cluster in the course of the analysis), k-means does not build a hierarchy, which hierarchical clustering does (Fig. 9.4). Therefore, k- means belongs to the group of non-hierarchical clustering methods. For a better understanding of the approach, let’s take a look at how it works in practice. Figures 9.10, 9.11, 9.12 and 9.13 illustrate the four steps of the k-means clustering process—research has produced several variants of the original algo- rithm, which we briefly discuss in Box 9.2.
The k-means procedure is now repeated until a predetermined number of iterations are reached, or convergence is achieved (i.e., there is no change in the cluster affiliations). Three aspects are worth noting in terms of using k-means:
Fig. 9.10 k-means procedure (step 1: placing random cluster centers) Fig. 9.11 k-means procedure (step 2: assigning objects to the closest cluster center) Fig. 9.12 k-means procedure (step 3: recomputing cluster centers) Fig. 9.13 k-means procedure (step 4: reassigning objects to the closest cluster center)
compared to similar solutions, but not globally. Therefore, you should run k- means multiple times using different options for generating a starting partition.
Box 9.2 Variants of the Original k-means Method k-medians is a popular variant of k-means and has also been implemented in Stata. This procedure essentially follows the same logic and procedure as k- means. However, instead of using the cluster mean as a reference point for the calculation of the within cluster variance, k-medians minimizes the absolute deviations from the cluster medians, which equals the city-block distance. Thus, k-medians does not optimize the squared deviations from the mean as in k-means, but absolute distances. In this way, k-medians avoids the possible effect of extreme values on the cluster solution. Further variants, which are not menu-accessible in Stata, use other cluster centers (e.g., k-medoids; Kaufman and Rousseeuw 2005; Park and Jun 2009), or optimize the initialization process (e.g., k-means++; Arthur and Vassilvitskii 2007).
|
ma'muriyatiga murojaat qiling