on 04 April
Clustering algorithms are a powerful technique for machine learning on unsupervised data. The most common algorithms in machine learning are hierarchical clustering and K-Means clustering. These two algorithms are incredibly powerful when applied to different machine learning problems.
Both k-means and hierarchical clustering have been applied to different scenarios to help gain new insights into the problem. Before diving into the innovative uses of clustering algorithms, I will first share an overview of the two algorithms.
What is unsupervised learning?
Before we get started, let me first introduce the concept of unsupervised learning. Unsupervised learning is where you train a machine learning algorithm, but you don’t give it the answer to the problem.
1) K-means clustering algorithm
The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.
2) Hierarchical clustering
Hierarchical clustering algorithms seek to create a hierarchy of clustered data points.
The algorithm aims to minimize the number of clusters by merging those closest to one another using a distance measurement such as Euclidean distance for numeric clusters or Hamming distance for text.
Here are 7 examples of clustering algorithms in action.
1. Identifying Fake News
Fake news is not a new phenomenon, but it is one that is becoming prolific.
What the problem is: Fake news is being created and spread at a rapid rate due to technology innovations such as social media. The issue gained attention recently during the 2016 US presidential campaign. During this campaign, the term Fake News was referenced an unprecedented number of times.
How clustering works: In a paper recently published by two computer science students at the University of California, Riverside, they are using clustering algorithms to identify fake news based on the content.
The way that the algorithm works is by taking in the content of the fake news article, the corpus, examining the words used and then clustering them. These clusters are what helps the algorithm determine which pieces are genuine and which are fake news. Certain words are found more commonly in sensationalized, click-bait articles. When you see a high percentage of specific terms in an article, it gives a higher probability of the material being fake news.
2. Spam filter
You know the junk folder in your email inbox? It is the place where emails that have been identified as spam by the algorithm.
Many machine learning courses, such as Andrew Ng’s famed Coursera course, use the spam filter as an example of unsupervised learning and clustering.
What the problem is: Spam emails are at best an annoying part of modern day marketingtechniques, and at worst, an example of people phishing for your personal data. To avoid getting these emails in your main inbox, email companies use algorithms. The purpose of these algorithms is to flag an email as spam correctly or not.
How clustering works: K-Means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together.
These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers.
3. Marketing and Sales
Personalization and targeting in marketing is big business.
This is achieved by looking at specific characteristics of a person and sharing campaigns with them that have been successful with other similar people.
What the problem is: If you are a business trying to get the best return on your marketing investment, it is crucial that you target people in the right way. If you get it wrong, you risk not making any sales, or worse, damaging your Customer trust.
How clustering works: Clustering algorithms are able to group together people with similar traits and likelihood to purchase. Once you have the groups, you can run tests on each group with different marketing copy that will help you better target your messaging to them in the future.
Imagine you want to understand the different types of traffic coming to your website. You are particularly interested in understanding which traffic is spam or coming from bots.
What the problem is: As more and more services begin to use APIs on your application, or as your website grows, it is important you know where the traffic is coming from. For example, you want to be able to block harmful traffic and double down on areas driving growth. However, it is hard to know which is which when it comes to classifying the traffic.
How clustering works: K-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous Autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively.
5. Identifying fraudulent or criminal activity
In this scenario, we are going to focus on fraudulent taxi driver behavior. However, the technique has been used in multiple scenarios.
What is the problem: You need to look into fraudulent driving activity. The challenge is how do you identify what is true and which is false?
How clustering works: By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent.
There are many different reasons why you would want to run an analysis on a document. In this scenario, you want to be able to organize the documents quickly and efficiently.
What the problem is: Imagine you are limited in time and need to organize information held in documents quickly. To be able to complete this ask you need to: understand the theme of the text, compare it with other documents and classify it.
How clustering works: Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the paragraph.
Ok so up until this point we have looked into different business problems and how clustering algorithms have been applied to solve them.
But now for the critical issues – fantasy football!
What is the problem: Who should you have in your team? Which players are going to perform best for your team and allow you to beat the competition? The challenge at the start of the season is that there is very little if any data available to help you identify the winning players.
How clustering works: When there is little performance data available to train your model on, you have an advantage for unsupervised learning. In this type of machine learning problem, you can find similar players using some of their characteristics. This has been done using K-Means clustering. Ultimately this means you can get a better team more quickly at the start of the year, giving you an advantage.
How will you use clustering algorithms?
So there you have it, those were 7 innovative uses of clustering algorithms. As you can see, while the technique remains reasonably constant, you can apply it to many different scenarios.
Looking at the characteristics of different groups of data can help you make better predictions of behavior. In this scenario, the real value of the algorithms is to help you create the best possible groups of data.
Once you have a solid foundation of grouped data to work with, the opportunities become infinite.