Prediction via KNN (K Nearest Neighbours) Concepts: Part 1

K Nearest Neighbor (KNN ) is one of those algorithms that are very easy to understand and has a good accuracy in practice. KNN can be used in different fields from health, marketing, finance and so on [1]. KNN is easy to understand and also the code behind it in R also is too easy to write. In this post, I will explain the main concept behind KNN. Then in Part 2 I will show how to write R codes for KNN. Finally in the Part 3 the process of how run KNN in Power BI data will be explained.

To understand the KNN concepts, consider below example:
We are designing a game for children below 6 . First we asked them to close their eyes and then by tasting a fruit, identify is it sour or sweet. based on their answers, we have below diagram

as you can see we have three main groups based on the level of sweetness and sourness. we asked children to put a number of sweetness and sourness for each fruits in 10 scale. so we have below numbers. As you can see Lemon for example, has the high number in Sourness and low number in sweetness. Whist, Watermelon has high number (9) in sweetness and number 1 for sourness. (this is a example maybe the number is not correct, the aim of this example to show the concepts behind the KNN)

Imagine that we have a fruit that is not in above list, we want to identify the nearness of that fruit to others and then identify it is a sweet fruit or sour one. Consider figs as example. to identify it is a sweet or sour fruit, we have some number of its level of sourness and sweetness as below

as you can see for sweetness it is 7 and for sourness it is 3

to find which fruit is near to this one, we should calculate the distance between Figs and other fruits.

from mathematics perspective, to find out distance between two points, we use the Euclidean distance formula as below:

For calculating the distance between Figs and Lemon, we first subtract their dimensions (above formula)

distance between Fig and Lemon is 8.2 now we are going to calculate this distance for all other fruits. as you can see in below table, the distance between Cherry and Grapes is so close to Figs (distance 1.41)

hence, Cherry and Grape are closet neighbor to Fig, we call them the first Nearest Neighbor. Watermelon with 2.44 is the Second Nearest Neighbor to Figs. the third nearest neighbor is strawberry and banana.

as you see in this example we calculate 8 nearest neighbor.

8 nearest neighbor for this example is Lemon with 8.4 distance. there is a lot distance between Lemon and Figs, so it is not correct to consider Lemon as nearest Neighbor. to find the best number for k(number of neighbors) we have consider the square root of the number of observations in our example. For instance,we have 10 observations which Square root is 3, so we have 3 nearest neighbors based on distance as first neighbor(Cherry and Grapes), second neighbor(Watermelon) and third is (Banana and Strawberry).

Because all of these are Sweet fruits, we consider Figs as a sweet one.

so in any example we calculate the distance of items to others categories. there other methods for calculating the distance.

KNN, has been used to predict a group for new items. for example:
1. predict that a customer will stay with us or not (new customer belong to with group: stay or leave)

2. image processing, if an uploaded picture of animal is related t birds, cats, and so on.

In the next post I will explain the related R codes for KNN .

[1]https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/

[2].Machine Learning with R,Brett Lantz, Packt Publishing,2015.

Save

Leila Etaati

Trainer, Consultant, Mentor

Leila is the first Microsoft AI MVP in New Zealand and Australia, She has Ph.D. in Information System from the University Of Auckland. She is the Co-director and data scientist in RADACAD Company with more than 100 clients in around the world. She is the co-organizer of Microsoft Business Intelligence and Power BI Use group (meetup) in Auckland with more than 1200 members, She is the co-organizer of three main conferences in Auckland: SQL Saturday Auckland (2015 till now) with more than 400 registrations, Difinity (2017 till now) with more than 200 registrations and Global AI Bootcamp 2018. She is a Data Scientist, BI Consultant, Trainer, and Speaker. She is a well-known International Speakers to many conferences such as Microsoft ignite, SQL pass, Data Platform Summit, SQL Saturday, Power BI world Tour and so forth in Europe, USA, Asia, Australia, and New Zealand. She has over ten years’ experience working with databases and software systems. She was involved in many large-scale projects for big-sized companies. She also AI and Data Platform Microsoft MVP. Leila is an active Technical Microsoft AI blogger for RADACAD.

4 thoughts on “Prediction via KNN (K Nearest Neighbours) Concepts: Part 1”

Thank you for the article.
How does one decide which classifier algorithm to choose? Should one be using KNN? SVM? etc…
Any guidance?
Thank you…
John

Leila Etaati says:

May 2, 2017 at 3:46 pm

Your welcome,
It depends KNN is good for both linear and none linear data but not good for outlier, SVM is good for outlier. Also SVM can be used in linear or non-linear. SVM works differently and it is good and fast solution for many problems.KNN is also very sensitive to bad features (attributes) so feature selection is also important.

Loading...

Reply

What’s the memory & file size impact of R code on a Power BI solution such as the KNN or one that may include data modeling & a large dataset?

Leila Etaati says:

October 11, 2017 at 11:31 am

Hi tony, I will write a blog soon and I will show how if we have different size of data using one algorithm how is the speed of training, hope that one helps to answer, that is a really good question, however I applied R algorithm in other big dataset for aim of textmining and other aim, it was goo, but for KNN I should try

Loading...

Reply

John Petroda says:

April 16, 2017 at 4:39 am

Thank you for the article.
How does one decide which classifier algorithm to choose? Should one be using KNN? SVM? etc…
Any guidance?
Thank you…
John

Loading...

- Leila Etaati says:
  
  May 2, 2017 at 3:46 pm
  
  Your welcome,
  It depends KNN is good for both linear and none linear data but not good for outlier, SVM is good for outlier. Also SVM can be used in linear or non-linear. SVM works differently and it is good and fast solution for many problems.KNN is also very sensitive to bad features (attributes) so feature selection is also important.
  
  Loading...
  
Tony says:

October 8, 2017 at 6:24 pm

What’s the memory & file size impact of R code on a Power BI solution such as the KNN or one that may include data modeling & a large dataset?

Loading...

- Leila Etaati says:
  
  October 11, 2017 at 11:31 am
  
  Hi tony, I will write a blog soon and I will show how if we have different size of data using one algorithm how is the speed of training, hope that one helps to answer, that is a really good question, however I applied R algorithm in other big dataset for aim of textmining and other aim, it was goo, but for KNN I should try
  
  Loading...

4 thoughts on “Prediction via KNN (K Nearest Neighbours) Concepts: Part 1”

Leave a ReplyCancel reply