This blog post on the K-nearest neighbor algorithm is part of the blog post series Understanding AI Algorithms. K-nearest neighbor (K-NN) is a classification algorithm that is easy to understand and work with.
Because it is simple, it is a good way to grasp the basics of machine learning and become familiar with the fundamentals of data handling and algorithm implementation. Like the other examples we’ve seen, this algorithm also comes with some downsides, and K-NN is not recommended for large data sets and complex problems.
Even so, it can be appropriate for detecting outliers in data (called anomaly detection), to recommend content for system users (recommendation systems), and for search engines when trying to understand an intended search (semantic searching).
K-nearest neighbor works by detecting and analyzing the nearest observations to the new data that you wish to classify.
The K is simply the number of observations that the data scientist has chosen for the algorithm. For example, let’s say you’re at an outdoor concert. If a data scientist wants to predict which country you are from, the K-NN could be used to help narrow that down.
Let’s say we set the K value to five. This means the system (or data scientist) will check the five people nearest to you to determine their country of origin. Let’s say that four of the people closest to you are from Sweden and one is from Norway. From this knowledge, the algorithm would assume that you are from Sweden because most of the people around you are too.
Consider this example in the figure above where you are the black dot. Since K is set to five, we will check the five nearest observations (the square and triangular shapes in the circle). Since the majority of the five observations are triangles (from Sweden) and only one is a square (from Norway) you will be classified as a triangle (Swedish).
K-nearest neighbor is a simple algorithm that is easy to set up.
K-nearest neighbor is a simple algorithm that is easy to set up.
It is also flexible because the number of K neighbors and the distance between them are chosen for what is appropriate for the data being analyzed.
However, it can be difficult to figure out which variables (attributes) to use for classification. If irrelevant variables are added, the calculated distance can be misleading and the model will perform poorly. This error can be reduced by weighting the variables (in other words, giving some more value than others in the model) by deciding which aspects of the issue are more important.
K-NN also requires lots of computer memory space because the algorithm needs all the historical data points to predict a new observation.
Rather than summarizing the data into a simple model, it compares older observations and classifies the new observations into similar groups. The complexity of this means the calculations can take a long time to produce effective observations.
If you want to read all the related articles on the topic of AI algorithms, here is the list of all blog posts in this article series: