This blog post on the K-Means algorithm is part of the article series Understanding AI Algorithms. K-Means is a clustering algorithm.
K-Means is an algorithm that segments data into clusters to study similarities. This includes information on customer behavior, which can be used for targeted marketing.
The system looks at similarities between observations (for example, customers) and establishes a centroid, which is the center of a cluster. The algorithm then determines the similarities between for example customers by assigning clusters belonging to each observation to the nearest centroid.
An appropriate measurement (defined by the data scientist) is used to determine the distance between observations and centroids. Don’t worry if that doesn’t make sense—we’ll get there.
Let’s go back to the outdoor concert mentioned in an earlier article in this blog post series.
This time, instead of trying to determine a previously known group like we did when classifying a person’s country of origin, we want to know if there actually are groups of similar people in the audience. When we begin, we don’t know what clusters exist, meaning we don’t know what characteristics to use to group people in the crowd.
However, perhaps we find that close to the stage there is a group of people (cluster 1), by the side is another group (cluster 2), and in the middle is a third group (cluster 3).
Once we recognize that these clusters exist, we can then find out what variable keeps them together. Perhaps cluster 1 is a group of friends, cluster 2 is a school class, and cluster 3 is a mix of people who bought their tickets at the same time and ended up standing together.
To train a model, the algorithm first needs to know how many clusters to segment the data into. Choosing the right number of clusters can be tricky and requires exploration and testing.
When the number of clusters is set, K-means starts with randomly assign initial centroids. Remember that we began by not knowing where the clusters are. To start, we randomly set one centroid (the arbitrary center of a cluster) to be in the back, one in the front, and one on the right side of the concert.
From this, the K-means algorithm looks around the centroids to find the points nearby. It then updates where the centroids are located to try to find the middle of the groups that seem to exist. In the example of our second cluster, the school class, the centroid would try to find the middle of that group of people.
Eventually, it finds the place in the crowd that is closest to all members of that cluster without taking in other people in the crowd. After all the adjustments are made, the algorithm sees that there can be no more refinement and stops moving.
This is quite useful in marketing because it allows us to segment large groups of customers and study them based on their behavior or other shared attributes. It discovers commonalities that may not have been apparent before, allowing for more effective targeting and outreach.
As with all approaches, there are advantages and disadvantages here.
On one hand, K-means is easy to implement, meaning it can be experimented with and analyzed without a huge cost or investment of time. It has also been proven quite effective despite the risk that poorly chosen centroids can lead the model to become stuck in false clusters, meaning it thinks it’s found a cluster that is really two different clusters, for example.
It is also useful because the algorithm can handle a huge amount of data. As we discussed earlier, the more data that can be used, the clearer and more accurate the results will be. Using an algorithm that can handle that scale is important.
On the other hand, there are problems with K-means that can make it less useful in some cases. For example, it performs poorly when clusters of data are not round (when data points are scattered, for example if our school class doesn’t stand together in a circle at the concert), or when clusters have different densities, meaning when observations in some clusters are farther apart.
It can also be difficult to determine how many clusters of data to use to create your desired result.
Issues with the shape and distribution of clusters can be remedied in some cases by using the next in our list of clustering algorithms, called DBSCAN.
If you want to read all the related articles on the topic of AI algorithms, here is the list of all blog posts in this article series: