Cluster Analysis 9


Select a Measure of Similarity or Dissimilarity


Download 1.02 Mb.
bet7/20
Sana19.06.2023
Hajmi1.02 Mb.
#1608167
1   2   3   4   5   6   7   8   9   10   ...   20
Bog'liq
Cluster Analysis9

Select a Measure of Similarity or Dissimilarity


In the previous section, we discussed different linkage algorithms used in agglom- erative hierarchical clustering as well as the k-means procedure. All these clustering procedures rely on measures that express the (dis)similarity between pairs of objects. In the following section, we introduce different measures for metric, ordinal, nominal, and binary variables.




        1. Metric and Ordinal Variables


Distance Measures


A straightforward way to assess two objects’ proximity is by drawing a straight line between them. For example, on examining the scatter plot in Fig. 9.1, we can easily see that the length of the line connecting observations B and C is much shorter than the line connecting B and G. This type of distance is called Euclidean distance or straight line distance; it is the most commonly used type for analyzing metric variables and, if the scales are equidistant (Chap. 3), ordinal variables. Statistical software programs such as Stata simply refer to the Euclidean distance as L2, as it is a specific type of the more general Minkowski distance metric with argument 2 (Anderberg 1973). Researchers also often use the squared Euclidean distance, referred to as L2 squared in Stata. For k-means, using the squared Euclidean

distance is more appropriate because of the way the method computes the distances from the objects to the centroids (see Section 9.3.2.2).


In order to use a hierarchical clustering procedure, we need to express these distances mathematically. Using the data from Table 9.1, we can compute the Euclidean distance between customer B and customer C (generally referred to as d(B,C)) by using variables x and y with the following formula:

2 2
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

dEuclideanðB; CÞ ¼
ðxB xCÞ þ ðyB yCÞ

As can be seen, the Euclidean distance is the square root of the sum of the squared differences in the variables’ values. Using the data from Table 9.1, we obtain the following:



2 2
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffi

dEuclideanðB; CÞ ¼
ð82 — 66Þ þ ð94 — 80Þ ¼ 452 ≈ 21:260

This distance corresponds to the length of the line that connects objects B and


C. In this case, we only used two variables, but we can easily add more under the root sign in the formula. However, each additional variable will add a dimension to our research problem (e.g., with six clustering variables, we have to deal with six dimensions), making it impossible to represent the solution graphically. Similarly, we can compute the distance between customer B and G, which yields the following:



dEuclidean


ðB; GÞ ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

11, 113

pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð82 — 10Þ2 þ ð94 — 17Þ2 ¼ ≈ 105:418

Likewise, we can compute the distance between all other pairs of objects and summarize them in a distance matrix. Table 9.2, which we used as input to illustrate the single linkage algorithm, shows the Euclidean distance matrix for objects A-G.


There are also alternative distance measures: The city-block distance (called L1 in Stata) uses the sum of the variables’ absolute differences. This distance measure is referred to as the Manhattan metric as it is akin to the walking distance between two points in a city like New York’s Manhattan district, where the distance equals the number of blocks in the directions North-South and East-West. Using the city- block distance to compute the distance between customers B and C (or C and B) yields the following:

Download 1.02 Mb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8   9   10   ...   20




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling