Prediction via KNN (K Nearest Neighbours) Concepts: Part 1

Posted by on Mar 22, 2017 in Analytics, Azure Machine Learning, R | 2 Comments
Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

fruit

K Nearest Neighbor (KNN ) is one of those algorithms that are very easy to understand and  has a good accuracy in practice. KNN can be used in different fields from health, marketing, finance and so on [1]. KNN is easy to understand and also the code behind it in R also is too easy to write. In this post, I will explain the main concept behind KNN. Then in Part 2 I will show how to write R codes for KNN. Finally in the Part 3 the process of how run KNN in Power BI data will be explained.

To understand the KNN concepts, consider below example:
We are designing a game for children below 6 . First we asked them to close their eyes and then by tasting a fruit, identify is it sour or sweet. based on their answers, we have below diagram

fruit

as you can see we have three main groups based on the level of sweetness and sourness. we asked children to put a number of sweetness and sourness for each fruits in 10 scale. so we have below numbers. As you can see Lemon for example, has the high number in Sourness and low number in sweetness. Whist, Watermelon has high number (9) in sweetness and number 1 for sourness. (this is a example maybe the number is not correct, the aim of this example to show the concepts behind the KNN)

table1

 

Imagine that we have a fruit that is not in above list, we want to identify the nearness of that fruit to others and then identify it is a sweet fruit or sour one. Consider figs as example. to identify it is a sweet or sour fruit, we have some number of its level of sourness and sweetness as below

fig

as you can see for sweetness it is 7 and for sourness it is 3

disatnce fig

to find which fruit is near to this one, we should calculate the distance between Figs and other fruits.

from mathematics perspective, to find out distance between two points, we use the Euclidean distance formula as below:

formula1 distance

For calculating the distance between Figs and Lemon, we first subtract their dimensions (above formula)

distance between Fig and Lemon is 8.2 now we are going to calculate this distance for all other fruits. as you can see in below table, the distance between Cherry and Grapes is so close to Figs (distance 1.41)

fig-distance

hence, Cherry and Grape are closet neighbor to Fig, we call them the first Nearest Neighbor. Watermelon with 2.44 is the Second Nearest Neighbor to Figs. the third nearest neighbor is strawberry and banana.

as you see in this example we calculate 8 nearest neighbor.

8 nearest neighbor for this example is Lemon with 8.4 distance. there is a lot distance between Lemon and Figs, so it is not correct to consider Lemon as nearest Neighbor. to find the best number for k(number of neighbors) we have consider the square root of the number of  observations in our example. For instance,we have 10 observations which Square root is 3, so we have 3 nearest neighbors based on distance as first neighbor(Cherry and Grapes), second neighbor(Watermelon) and third is (Banana and Strawberry).

Because all of these are Sweet fruits, we consider Figs as a sweet one.

so in any example we calculate the distance of items to others categories. there other methods for calculating the distance.

KNN, has been used to predict a group for new items. for example:
1. predict that a customer will stay with us or not (new customer belong to with group: stay or leave)

2. image processing, if an uploaded picture of animal is related t birds, cats, and so on.

In the next post I will explain the related R codes for KNN .

[1]https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/

[2].Machine Learning with R,Brett Lantz, Packt Publishing,2015.

Save

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail
rssyoutuberssyoutube
Leila Etaati
Dr. Leila Etaati is Principal Data Scientist, BI Consultant, and Speaker. She has over 10 years’ experience working with databases and software systems. She was involved in many large-scale projects for big sized companies. Leila has PhD of Information System department, University of Auckland, MS and BS in computer science. Leila is Microsoft Data Platform MVP.

2 Comments

  • Thank you for the article.
    How does one decide which classifier algorithm to choose? Should one be using KNN? SVM? etc…
    Any guidance?
    Thank you…
    John

    • Your welcome,
      It depends KNN is good for both linear and none linear data but not good for outlier, SVM is good for outlier. Also SVM can be used in linear or non-linear. SVM works differently and it is good and fast solution for many problems.KNN is also very sensitive to bad features (attributes) so feature selection is also important.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">