Cluster analysis is a data reduction technique that aims to reveal a subset of observations in a data set. An important use of cluster analysis is to divide the target groups with multiple indicators. Such kind of classification of target groups is the basis and core of refined management and personalized operation. Only when the correct classification is carried out can the individual be effectively differentiated including refined operations, services and product support.
Sometimes we will encounter requirements not only to aggregate and analyze data, but also to divide data groups into more meaningful categories based on given indicators so that the analysis can be refined. Now you can try the hierarchical cluster analysis method introduced in this article.
The following table lists the eight main variables of the average annual consumption expenditure per capita of urban households in 31 provinces, municipalities directly under the central government and autonomous regions across the country in 1999.
The Ward method is more sensitive to outliers and tends to aggregate classes with a small number of observations, and it is easier to generate classes with roughly the same number of observations. From the above four results, the Ward method better meets the actual clustering requirements. Therefore, we choose Ward method for the next cluster analysis.
Select the number of clusters
We can choose to divide the 31 regions into 3 categories.
We can achieve the following clustering results.
#Divide the tree diagram into three categories
clusters4 <- cutree(hc4, k=3)
#Get the median of each category
aggregate(a, by=list(cluster=clusters4), median)
#Get the median of each category after standardization
We can display the classified areas on the map to obtain a more intuitive result.
Obviously, in 1999, Beijing, Zhejiang, Shanghai, and Guangdong, which belonged to Category I, were the regions with the most developed economy and the highest consumption level of urban residents in China. Areas belonging to the second category, such as Tianjin, Jiangsu, and Chongqing, were basically areas with a medium level of economic development and urban residents' consumption level in China. Areas belonging to the third category, such as Shanxi and Gansu, were basically underdeveloped areas in China, and the consumption level of urban residents was also low.
R has advantages in analyzing data categories and can generate specific classifications. When we are unable to classify data instances and analyze the relationship in Power BI, please try to create R visuals. The cooperation effect between R and Power BI will be unexpectedly great.