Identifying Number of Cluster in K-mean Algorithm in Power BI: Part 7

Posted by on May 17, 2017 in Uncategorized | 2 Comments
Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

plotcluster

I have explained the main concept behind the Clustering algorithm in Post 5 and also I have explained how to do cluster analysis in Power BI in Part 6.
In this post, I will explain how identify the best number of cluster for doing cluster analysis by looking on the “elbow chart”

K-Mean clusters the data into k clusters. we need some way to identify whether we using the right number of clusters.

elbow method is a  way to validate the number of clusters to get higher performance. The idea of the elbow method is to run k-means clustering on the dataset for a range of  K values.

The min concepts is to minimize the “sum of squared errors (SSE)” that is the distance of each object with the mean of each cluster. we try k from 1 to the number of observation and test the SSE.

Let’s have a look on  a “Elbow Chart”.

plotcluster

as you can see in above picture, In Y axis we have SSE that is the distance of objects from the cluster mean. smaller SSE means that we have better cluster (see post part 5).

so as the number of cluster increase in X axis, SSE become smaller. But we need minimum number of cluster with the minimum SSE, so in above example, we choose the elbow of chart to ha.ve both minimum number of cluster and minimum SSE.

So, Back to example I have done in post part 6, I am going to show how to have Elbow chart in Power BI using R codes.

wss <- (nrow(dataset[,1:4])-1)*sum(apply(dataset[,1:4],2,var))
for (i in 2:15) wss[i] <- sum(kmeans(dataset[1:4],  centers=i)$withinss)
plot(1:15, wss, type=”b”, xlab=”Number of Clusters”, ylab=”Within groups sum of squares”)

I write this code inside Power BI R editor visualization.

powerbir

According to the explanation, for clustering Fitbit data we need 4 or 3 cluster. which is minimum SSE and minimum number of Cluster. by applying this number, w should have better clustering.

You able to download the power BI file for cluster analysis and evaluation from below

Download Demo File


Enter Your Email to download the file (required)

[1]https://stats.stackexchange.com/questions/147741/k-means-clustering-why-sum-of-squared-errors-why-k-medoids-not

 

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail
rssyoutuberssyoutube
Leila Etaati

Dr. Leila Etaati is Principal Data Scientist, BI Consultant, and Speaker. She has over 10 years’ experience working with databases and software systems. She was involved in many large-scale projects for big sized companies. Leila has PhD of Information System department, University of Auckland, MS and BS in computer science. Leila is Microsoft Data Platform MVP.


2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *