K Nearest Neighbour (KNN ) is one of those algorithms that are very easy to understand and it has a good level of accuracy in practice. In Part One of this series, I have explained the KNN concepts. In Part 2 I have explained the R code for KNN, how to write R code and how to evaluate the KNN model. In this post, I want to show how to do KNN in Power BI.
If you do not have the Power BI Desktop, install it from https://powerbi.microsoft.com/en-us/
In power BI, Click on “Get Data” to import the data into Power BI, the data set . we have used the same dataset as in Part Two. The dataset contains the patient data such as : their diagnosis and laboratory results (31 columns).
You will see (number 3) data in right side of the Power BI.
We want to clean the data first, hence we click on the “Edit Queries” in Power BI to do some data cleaning and also apply R scripts for KNN Model creation (Number 1 and 2).
By clicking on the “Edit Query”, we will see the “Query Editor” windows.
First of all, we want to remove the “ID” column. ID attributes does not have impact on prediction results. Hence, we right click on the “ID” column in power bi (number 1 and 2). and remove the ID column from data set.
Another data cleaning approach is about replacing “B” value with “Benign ” and “M” with “Malignant” in Diagnosis column. To do that, we right click on the diagnosis column (number 1). Then click on the “Replace Value” (number 2) in Transform tabe. In replace values place, for “Value To Find” type “B” (number 3) then “Replace With” the “Benign”. Do the same for Malignant.
The result of the applied query will be look like below picture:
Another data cleaning is about data normalization. normalization has been explained in Part Two. We want to convert all the measurement value in same scale. hence we click on “Transform” tab, then in transform tab click on the data set (any numeric column). In this step we are using R scripts to normalize data. So, we click on the R scripts to perform normalization.
we write the same code we have in Part 2. the whole data (wis_bc_data) will be hold in “dataset”. for doing normalization, we first write a function (number 1) and the function will be store in”normalize” variable. Then we apply normalize function on dataset. (number 3). we want to apply the function on numeric data not text (diagnosis column) hence, we refer to dataset[2:31] that means apply function on column 2 to column 31. the result of the function is data frame that will be sored in “wbcd_n” variable.
In next step, we are going to create test and train data set (Part 2). in this example, we going to put aside row number 1 to 469 for training and creating model and from row number 470 to 569 for testing the model. Finally, data is ready, now we able to train model and create KNN algorithm.
We already installed package “Class” in R studio. Now we able to call function KNN to predict the patient diagnosis. KNN function accept the training data set and test data set as second arguments. moreover the prediction label also need for result. we want to use KNN based on the discussion on Part 1, to identify the number K (K nearest Neighbor), we should calculate the square root of observation. here for 469 observation the K is 21. the result is “wbcd_test_pred” holds the result of the KNN prediction. however, we want to have the result beside the real data so we store the test data set in “output” variable, then add the separate column to store the prediction result (number 3). Then click on “OK” to apply the R scripts on data.
We will see below output:
for each output (Data frame in R scripts), we will have a table value. we store the final result in “output” in our R scripts, so we click on the Value (table) in front of the “output” name (see above picture). The result will be look like below picture. In output, we will have the “Test” dataset and in the last column, we will have the prediction results. If you look at the left bottom side, you will see 100 rows that are the number of test case we have.
After R transformation, just click on the “Close&Apply” to see the result in visualization.
However, sometimes you want also see the patient ID in result data set. So what we do is to change the R code, as below :
That means, we just get the whole data set from row 470 to 569. This data set is original one that contains the patient ID.
Hence, just close and apply the query.
In the below visualization, I have shown the result of the prediction and the real data. In current dataset we have patient ID, real data about the doctor’s diagnosis and the predicted diagnosis by KNN. in Real Diagnosis filter (picture:number 1) we able to select patient that “Malignant” and in second filter we able to choose the prediction result “Benign”. this will show us the cases that prediction could not work properly.
we have 2 cases that the KNN predict in wrong way. as you see in below picture.
In the next series, I will talk about the other algorithms such as Neural Network (Deep Learning), Time series, and Decision Tree.
Moreover, there is an upcoming series on Azure ML which will be start soon.
