Cluster Analysis 9


Validate and Interpret the Clustering Solution


Download 1.02 Mb.
bet19/20
Sana19.06.2023
Hajmi1.02 Mb.
#1608167
1   ...   12   13   14   15   16   17   18   19   20
Bog'liq
Cluster Analysis9

Validate and Interpret the Clustering Solution


In the next step, we create a cluster membership variable, which indicates the cluster to which each observation belongs. To do so, go to ► Statistics ► Multi- variate analysis ► Cluster analysis ► Postclustering ► Generate summary variables from cluster. In the dialog box that opens (Fig. 9.23), enter a name, such as cluster_wl, for the variable to be created in the Generate variable(s) box. In the dropdown list From cluster analysis, we can choose on which previ- ously run cluster analysis the cluster membership variable should be based. As this is our first analysis, we can only select wards_linkage. Finally, specify the number of clusters to extract (3) under Number of groups to form and proceed by clicking OK.


Stata generates a new variable cluster_wl, which indicates the group to which each observation belongs. We can now use this variable to describe and profile the clusters. In a first step, we would like to tabulate the number of observations in each

Table 9.17 Cluster sizes


tabulate cluster_wl, missing

cluster_wl | Freq. Percent Cum.



+-
1 |

445

41.78

-
41.78

2 |

309

29.01

70.80

3 |

215

20.19

90.99

. |

96

9.01

100.00




+- - Total | 1,065 100.00

Table 9.18 Comparison of means


tabstat e1 e5 e9 e21 e22, statistics( mean ) by(cluster_wl)

Summary statistics: mean


by categories of: cluster_wl




-+
1 |

92.39326

75.50562

89.74607

81.7191

62.33933

2 |

97.1068

95.54693

97.50809

96.63754

92.84466

3 |

59.4186

58.28372

71.62791

56.72558

58.03256



cluster_wl | e1 e5 e9 e21 e22

-+
Total | 86.57998 78.07534 88.20124 80.93086 71.11146


-
cluster by going to ► Statistics ► Summary, tables, and tests ► Frequency tables ► One-way table. Simply select cluster_wl in the drop-down menu under Categorical variable, tick the box next to Treat missing values like other values and click on OK. The output in Table 9.17 shows that the cluster analysis assigned 969 observations to the three segments; 96 observations are not assigned to any segment due to missing values. The first cluster comprises 445 observations, the second cluster 309 observations, and the third cluster 215 observations.
Next, we would like to compute the centroids of our clustering variables. To do so, go to ► Statistics ► Summaries, tables, and tests ► Other tables ► Compact table of summary statistics and enter e1 e5 e9 e21 e22 into the Variables box. Next, click on Group statistics by variable and select cluster_wl from the list. Under Statistics to display, tick the first box and select Mean, followed by OK. Table 9.18 shows the resulting output.
Comparing the variable means across the three clusters, we find that respondents in the first cluster strongly emphasize punctuality (e1), while comfort (e5) and, particularly, entertainment aspects (e22) are less important. Respondents in the second cluster have extremely high expectations regarding all five performance features, as evidenced in average values well above 90. Finally, respondents in the third cluster do not express high expectations in general, except in terms of security (e9). Based on these results, we could label the first cluster “on-time is enough,” the

Table 9.19 Crosstab


tabulate cluster_wl flight_purpose, chi2 V

| Do you normaly fly


| for business or
| leisure purposes?
cluster_wl | Business Leisure | Total
-+ + 1 | 239 206 | 445
2 | 130 179 | 309
3 | 114 101 | 215
-+ + Total | 483 486 | 969

Pearson chi2(2) = 10.9943 Pr = 0.004 Cramér's V = 0.1065


second cluster “the demanding traveler,” and the third cluster “no thrills.” We could further check whether these differences in means are significant by using a one-way ANOVA as described in Chap. 6.



¼

¼
In a further step, we can try to profile the clusters using sociodemographic variables. Specifically, we use crosstabs (see Chap. 5) to contrast our clustering with the variable flight_purpose, which indicates whether the respondents primarily fly for business purposes ( flight_purpose 1) or private purposes ( flight_purpose 2). To do so, click on ► Statistics ► Summaries, tables, and tests ► Frequency tables ► Two-way table with measures of association. In the dialog box that opens, enter cluster_wl into the Row variable box and flight_purpose into the Column variable box. Select Pearson’s chi-squared and Cramer’s V under Test statistics and click on OK. The results in Table 9.19 show that the majority of respondents in the first and third cluster are business travelers, whereas the second cluster primarily comprises private travelers. The χ2-test

¼
statistic (Pr 0.004) indicates a significant relationship between these two
variables. However, the strength of the variables’ association is rather small, as indicated by the Cramer’s V of 0.1065.
The Oddjob Airways dataset offers various other variables such as age, gender, or status, which could be used to further profile the cluster solution. However, instead of testing these variables’ efficacy step-by-step, we proceed and assess the solution’s stability by running an alternative clustering procedure on the data. Specifically, we apply the k-means method, using the grouping from the Ward’s linkage analysis as input for the starting partition. To do so, go to:

  • Statistics ► Multivariate statistics ► Cluster analysis ► Cluster data ► Kmeans. In the dialog box that opens, enter e1, e5, e9, e21, and e22 into the Variables box, choose 3 clusters, and select L2squared or squared Euclidean under (Dis)similarity measure (Fig. 9.24). Under Name this cluster analysis, make sure that you specify an intuitive name, such as kmeans. When clicking on the Options tab, we can choose between different options of how k-means should derive a starting partition for the analysis. Since we want to use the clustering from




Fig. 9.24 k-means dialog box


Fig. 9.25 Options in the k-means dialog box


Table 9.20 Comparison of clustering results


tabulate cluster_wl kmeans
| kmeans
cluster_wl | 1 2 3 | Total
+ + 1 | 320 107 18 | 445
2 | 2 307 0 | 309
3 | 36 10 169 | 215
+ + Total | 358 424 187 | 969
our previous analysis by using Ward’s linkage, we need to choose the last option and select the cluster_wl variable in the corresponding drop-down menu (Fig. 9.25). Now click on OK.
Stata only issues a command (cluster kmeans e1 e5 e9 e21 e22, k
(3) measure(L2squared) name(kmeans) start(group (cluster_wl))) but also adds a new variable kmeans to the dataset, which indicates each observation’s cluster affiliation, analogous to the cluster_wl variable for Ward’s linkage. To explore the overlap in the two cluster solutions, we can contrast the results using crosstabs. To do so, go to ► Statistics ► Summary, tables, and tests ► Frequency tables ► Two-way table with measures of association and select cluster_wl under Row variable and kmeans under Column variable. After clicking on OK, Stata will produce an output similar to Table 9.20.

¼
The results show that there is a strong degree of overlap between the two cluster analyses. For example, 307 observations that fall into the second cluster in the Ward’s linkage analysis also fall into this cluster in the k-means clustering. Only two observations from this cluster appear in the first k-means cluster. The diver- gence in the clustering solutions is somewhat higher in the third and, especially, in the first cluster, but still low in absolute terms. Overall, the two analyses have an overlap of (320 + 307 + 169)/969 82.15%, which is very satisfactory as less than 20% of all observations appear in a different cluster when using k-means.
This analysis concludes our cluster analysis. However, we could further explore the solution’s stability by running other linkage algorithms, such as centroid or complete linkage, on the data. Similarly, we could use different (dis)similarity measures and assess their impact on the results. So go ahead and explore these options yourself!






    1. Download 1.02 Mb.

      Do'stlaringiz bilan baham:
1   ...   12   13   14   15   16   17   18   19   20




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling