Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedJoin us at the 2025 Microsoft Fabric Community Conference. March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for $400 discount. Register now
Definition:
Cluster analysis is a data reduction technique that aims to reveal a subset of observations in a data set. An important use of cluster analysis is to divide the target groups with multiple indicators. Such kind of classification of target groups is the basis and core of refined management and personalized operation. Only when the correct classification is carried out can the individual be effectively differentiated including refined operations, services and product support.
Scenario:
Sometimes we will encounter requirements not only to aggregate and analyze data, but also to divide data groups into more meaningful categories based on given indicators so that the analysis can be refined. Now you can try the hierarchical cluster analysis method introduced in this article.
Sample data:
The following table lists the eight main variables of the average annual consumption expenditure per capita of urban households in 31 provinces, municipalities directly under the central government and autonomous regions across the country in 1999.
Preparation:
We can download and install R for free from the Microsoft R Application Network or the CRAN Repository. (Latest version: R-4.0.3 for Windows (32/64 bit))
For the details, please refer to this document: Create Power BI visuals using R.
Operations:
# display the dataset
library(gridExtra) # gridExtra package provides a number of user-level functions to work with "grid" graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
grid.table(dataset) # display dataset as a table
Note: Duplicated rows will be removed from the data. If don’t want this, you can add a column with no duplicate values, such as an Index column.
#Reimport the data and set the value of the first column as a row variable
a<-read.table("your file address",sep=",",header=T,row.names=1)
#Standardize variables
a.scaled <- scale(a)
#Calculate instance distances
d <- dist(a.scaled)
hc1<-hclust(d, "complete") #Complete Linkage
hc2<-hclust(d, "average") #Average Linkage
hc3<-hclust(d, "centroid") #Centroid
hc4<-hclust(d, "ward.D2") #Ward
par(mfrow=c(2,2)) #Form a 2x2 picture matrix
plot(hc1, hang=-1, cex=.8, main="Average Linkage Clustering") #Complete Linkage
plot(hc2, hang=-1, cex=.8, main="Average Linkage Clustering") #Average Linkage
plot(hc3, hang=-1, cex=.8, main="Average Linkage Clustering") #Centroid
plot(hc4, hang=-1, cex=.8, main="Average Linkage Clustering") #Ward
The Ward method is more sensitive to outliers and tends to aggregate classes with a small number of observations, and it is easier to generate classes with roughly the same number of observations. From the above four results, the Ward method better meets the actual clustering requirements. Therefore, we choose Ward method for the next cluster analysis.
We can choose to divide the 31 regions into 3 categories.
#Divide the tree diagram into three categories
clusters4 <- cutree(hc4, k=3)
#Get the median of each category
aggregate(a, by=list(cluster=clusters4), median)
#Get the median of each category after standardization
aggregate(as.data.frame(a.scaled), by=list(cluster=clusters4),
median)
#Get the dendrogram
plot(hc4, hang=-1, cex=.8,
main="Average Linkage Clustering\n3 Cluster Solution")
#Overlay three types of solutions
rect.hclust(hc4, k=3)
We can display the classified areas on the map to obtain a more intuitive result.
Obviously, in 1999, Beijing, Zhejiang, Shanghai, and Guangdong, which belonged to Category I, were the regions with the most developed economy and the highest consumption level of urban residents in China. Areas belonging to the second category, such as Tianjin, Jiangsu, and Chongqing, were basically areas with a medium level of economic development and urban residents' consumption level in China. Areas belonging to the third category, such as Shanxi and Gansu, were basically underdeveloped areas in China, and the consumption level of urban residents was also low.
Summary:
R has advantages in analyzing data categories and can generate specific classifications. When we are unable to classify data instances and analyze the relationship in Power BI, please try to create R visuals. The cooperation effect between R and Power BI will be unexpectedly great.
Author: Yuna
Reviewers: Zoey
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.