Machine learning (commonly called “AI” these days) are getting into every industry. But how does it work, and what is required to create a machine-learning system that can “AI-power” your business?
In Predictive Analytics and Machine Learning, I presented an introduction to the topic. In this blog post, I explain the main steps required to create a machine-learning system that can be used to optimize the business processes or marketing decisions in your company.
The main steps are:
- Prepare and preprocess data
- Create and train a prediction model
- Test the prediction model accuracy
- Deploy and make predictions
In the following sections, these steps are explained in more detail.
Prepare and Preprocess Data
Finding the right historical data, getting it imported, and adjusting it in a suitable way is often the most time consuming part of a machine learning project. Massive amounts of data are often used, since the prediction model will become more accurate if it is trained with large data sets.
First, the data must be collected from various sources and imported to the machine learning system. The data may come from different types of structured SQL databases, unstructured NoSQL databases, or even binary format files or server logs. The data needs to be merged into the same system for a coherent analysis to be possible.
Finding the right historical data, getting it imported, and adjusting it in a suitable way is often the most time consuming part of a machine learning project
The second step is to preprocess the data into a more usable form, perhaps by removing incomplete data records or by filling in missing data fields with some sensible value, for example setting such values to zero or the average value of related data points.
Math can also be applied if needed. A good example is rescaling values to a normalized format, such as making sure all temperature measures are in Fahrenheit or Celsius, but not mixed. The prediction model will likely not be very accurate if it depends on temperature readings, and if those values are in a mix of Celsius and Fahrenheit.
Data formatting can also be done, for example to align the different date formats used in various countries.
There are different types of prediction algorithms available that solve various types of problems. For example, a model can predict binary values (binary classification), category attributes (multi-class classification), numeric values (regression), or group similarity (clustering).
Classification algorithms try to determine the discrete value of a data attribute; for example, if a person is likely to play golf or not, or what brand of car someone is likely to own. This information might be concluded to a certain confidence level based on factors like sex, age, income, education, or address.
A binary classification predicts if something is true or false, for example if an email is likely to be spam, or if a machine needs service. A multi-class classification predicts one of several predefined possibilities, such as if a person is most likely to own a sports car, a pickup truck, or a family car.
There are different types of prediction algorithms available that solve various types of problems
While classification algorithms attempt to find a value from a small set of possible values, regression algorithms try to predict a numeric value in any range, such as the most likely selling price of a home, or the traffic volumes of a tunnel.
Clustering analysis, on the other hand, can be used to group certain data records that show a similarity to each other in some manner, while still being distinct to the data records of other groups. A point-of-sale system could use this to segment customers into groups with similar purchase patterns, for example.
There are many machine-learning algorithms available, and they generally fall into two categories: supervised learning algorithms, and unsupervised ones.
Supervised learning algorithms are taught with a set of training data with well-defined expected outcomes. Credit card fraud is a good example—the model is taught with many previous credit card transactions, some of which are known to be fraudulent. Supervised learning algorithms are often used for classification and regression problems.
Unsupervised learning algorithms are designed to detect structures in data sets where the desired outcome is not known. In effect, you don’t know what you look for, as is the case with automated customer segmentation, for example.
Training the Model
Once the data is imported and preprocessed into a high-quality and usable format, it is used to run different experiments. Typically, a data scientist tries many combinations of machine learning algorithms and data sets to search for the combination that produces the most accurate predictions.
Each iteration is an experiment in which different ways to find the desired pattern are tested. The result of each experiment is a candidate model, which is an algorithm implementation that hopefully will determine accurately if new data matches a certain pattern.
Training the prediction model is the process of finding the best model by iteratively applying different machine learning algorithms to the historical data. Initial training of a machine learning model is really a number of trial-and-error experiments. This is best done by specialists, the data scientists or machine learning engineer.
Typically, a data scientist tries many combinations of machine learning algorithms
Once the best prediction model has been found, it is generated as software code to implement an effective solution to the problem being solved. The model is directly derived from actual historical data, rather than having data scientists try to invent a solution manually using their own ingenuity.
The whole process may seem a bit confusing by now. It is important to note the machine learning algorithms are run on historical data, with the purpose of finding a prediction model that can detect valuable patterns in future data (such as understanding how to detect if a credit card transaction is likely to be fraudulent).
The actual prediction model, on the other hand, is the solution to the problem (for example, software code that can determine if a new credit card transaction is likely to be fraudulent). Therefore, the prediction models are integrated into software applications to solve the problem, while the machine learning algorithms are only used during the development of the prediction models.
It is also important to understand that most prediction models don’t provide an exact answer—for example, yes or no to a question. Rather, in addition to the answer, machine learning algorithms typically return a probability or confidence factor (from 0-100%) based on how well new data match the pattern detected by the model.
Testing the Accuracy
During each iteration of creating new prediction model candidates, we must be able to measure if the model works, and how well it works compared to other model candidates. We obviously want a good model that predicts with high accuracy, and out of the top candidates, we will pick the best one for deployment.
But how can we test a model on future data that doesn’t exist yet? The trick is to only use a large share—but not all—of the historical data to develop the model. Use, for example, only 75% of the historical data to develop the model. This is called the training data. Keep the remaining 25% and use that to test the accuracy of the model versus known data the model isn’t derived from. This is called the testing data.
By testing the model versus the remaining real data, it is possible to measure how good the invented model is at predicting the future. If it doesn’t work, further experiments are needed to find models that work better with a higher accuracy.
Even though a good model has been found for a particular problem, it is useless until it is deployed into production use and actually solving the problem in a real-life scenario. When data scientists develop and test models, they are just in a laboratory where the model is of little practical use.
To do any practical work, the prediction model must be deployed as software, running on a computer with real data for real use. Traditionally, it has been a manual process to implement the selected model into a software function of practical use.
In effect, this means the prediction model must be implemented in Python, Java, C#, C/C++, or, at least, be accessible from such popular programming languages through an application programming interface (API). Software libraries like Scikit-learn or Keras/Tensorflow are often used to simplify the process.
Lately, readymade cloud-based hosting services have become available, dramatically simplifying both development, testing, and deployment of machine learning algorithms. Both Google, Microsoft and Amazon provide predictive analytics and machine learning technology as a hosted cloud platform, for example.
The prediction model must be deployed as software, running on a computer with real data for real use
You can develop, test, and deploy a machine learning system using such cloud services with little more than a web browser and a credit card. The final prediction models are implemented more or less automatically on their servers and can be accessed by any software application (including smartphone apps) over the Internet using a web services (REST) API.
Software applications call the machine learning model hosted in the cloud platform using a set of new data. The model helps solve the problem at hand by returning its prediction for the new data set.
With such cloud-based solutions now becoming available at a relatively low cost and skill threshold, predictive machine learning systems can be built much more easily than before, and are now feasible for small companies and individual developers of smartphone apps.
This dramatically reduces the barrier to entry and will drive the adoption of predictive analytics and machine learning to a scale never before seen.
Changing Conditions Require Re-Training
I have so far explained how a machine learning system could be trained, tested, and deployed for production use. But didn’t I over-promise? Nowhere have I described a system that actually learns on its own by adapting to changing conditions itself.
Admittedly, the prediction model is automatically derived from the historical data, but it does so with the help of a data scientist who performs experiments and selects the final model to use, which can then be implemented as a software solution. So where is the part where the machine learns by itself?
The process of feeding additional training data into a model later on is called re-training the model.
The trick is to feed the selected machine learning model with more training data over time that continues to teach it as more data become available. This is particularly important with data patterns that change, for example, seasonal temperature variations or new products that can be recommended to buyers in a web shop.
The process of feeding additional training data into a model later on is called re-training the model. This can be done any number of times dependent on when and how often new data sets become available, or how fast the data patterns being detected and predicted change.
Re-training may be difficult as time must be dedicated to importing and preprocessing additional training data into the model all the time. Wouldn’t it be better if the prediction model could continuously re-train itself?
This is possible if software can insert new training data into the prediction model automatically, thus creating a machine that learns continuously. This is truly a learning machine that adapts its behavior automatically as it learns more from additional data.
One example is the tunnel traffic flow prediction. Sensors read the traffic volume in many parts of the city, including the tunnel, each minute. These traffic measurements are continuously fed into the model to teach it how to predict tunnel traffic to an even higher accuracy over time.
Automatic and frequent re-training using new training data is particularly important in an environment where the detected and predicted patterns change over time, as the model needs to modify its behavior to adapt to its changing environment. The tunnel may have seasonal variations of the traffic flow, for example.
Software can insert new training data into the prediction model automatically, thus creating a machine that learns continuously
To this end, Microsoft added a re-training API to its Azure ML cloud services already back in 2015, enabling a software program to programmatically and continuously feed re-training data into the deployed prediction model. This allows it to improve its prediction accuracy by learning from new data all the time. The Amazon Web Services platform also provides an API for software control of its cloud services.
Only a few years ago, most people would have considered this to be science fiction. Today, any marginally skilled software developer can create systems that contain predictive analytics and continuous machine learning. With the rise of cloud hosted predictive analytics and machine learning services, these technologies can be used by almost any company, even small ones, and can be found in virtually every industry.