This post is Grassroots, meaning a reader posted it directly. If you see an issue with it, contact an editor.
If you’d like to post a Grassroots post, click here!

0.1
July 16, 2021

What is Overfitting and How Can It Be Corrected?

Photo by Ylanite Koppens on Pexels.

Overfitting is the scourge of machine learning algorithms and the most typical pitfall for newcomers. It cannot be overstated: do not propose a machine learning algorithm to your boss before you understand what overfitting is and how to cope with it. It will almost certainly mean the difference between a spectacular triumph and a disastrous failure.
Overfitting is a fascinating topic with exciting answers encoded in the very structure of the algorithms you’re using, so keep that in mind. Let’s have a look at what overfitting is and how we can combat it in the real world.

Your model is a little too squiggly

Overfitting is a fairly simple issue that appears to be paradoxical at first glance. Simply speaking, when your model fits the data too well, it’s called overfitting.

At first look, this may appear strange. The goal of machine learning is to find the best match for the data. What makes you think your model is too good at it?

The issue is in how we frame the goal of “fitting the data.” When it comes to machine learning, there are two key metrics to keep an eye on at all times: the training error and the test error. Training error is a measure of how well your model performed during training, while test error is a measure of how well it performed in the wild.

Our goal while developing an algorithm is to produce a model that works well in the real world. We don’t mind if our model fails miserably in training as long as it performs well in the real world.

In reality, the only reason we care about our training error is because it can provide insight into how it will perform in testing. If that link is broken, then quantifying our training error is no longer useful.

Because our data will undoubtedly contain some noise, that link will eventually break down. The real world rarely follows perfect curves, and even when it does, our measurements of those curves are often inaccurate. Consider rainfall measurement: we can sample the data to get a decent estimate of how much rain fell, but do we truly believe the exact amount of rain fell everywhere in a multi-mile radius? That would be insane.

As a result, it’s irrational to believe that perfectly fitting your data in training would result in equally good test results. Consider the following example of a dataset.

Now we can draw a line that perfectly fits the data — and many algorithms are particularly good at finding sophisticated solutions to this problem. We could use a nonlinear regression method to find something similar to this.

Do you see what’s going on? In training, we were able to perfectly fit the data – but why would we expect this to function in the real world?

Let’s say we start by evaluating the model on the green-colored test data.

Your model may appear to be cutting a good swath through the data, but we need to quantify the test point error. The error is commonly assessed in linear regression by the Euclidean distance from the hyperplane at the same X value. What will that entail?

Even though the training error is zero, there’s no denying that the test error measures are massive. The essence of the issue is that we overfit the training data at the expense of real-world performance.

Getting the Model to Stand Up Straight

The reason that our model was able to overfit the data is because we allowed it to find complex solutions to simple problems. Although that explanation is oversimplified, the intuition is there. What if we merely look for easier ways to deal with simple data?

Regularization is the term for this notion. The idea is that we alter the error measure to indicate a preference for simple solutions, without getting too deep into the arithmetic. We measure the model’s complexity and penalise it for coming up with strange solutions like the one we saw above, in addition to analysing its performance on the data. This is stated as a sum of the training error plus a complexity penalty, which is called an error measure.

We can guide our model towards simpler answers by punishing it for its complexity. By doing so, we raise the model’s bias while lowering the variance considerably. Because the larger source of error in the model is generally variance, the tiny price we pay in increased bias will result in a net gain in performance.

Overfitting is the scourge of machine learning algorithms and the most typical pitfall for newcomers. It cannot be overstated: do not propose a machine learning algorithm to your boss before you understand what overfitting is and how to cope with it. It will almost certainly mean the difference between a spectacular triumph and a disastrous failure.

Overfitting is a fascinating topic with exciting answers encoded in the very structure of the algorithms you’re using, so keep that in mind. Let’s have a look at what overfitting is and how we can combat it in the real world.

Your model is a little too squiggly

Overfitting is a fairly simple issue that appears to be paradoxical at first glance. Simply speaking, when your model fits the data too well, it’s called overfitting.

At first look, this may appear strange. The goal of machine learning is to find the best match for the data. What makes you think your model is too good at it?

The issue is in how we frame the goal of “fitting the data.” When it comes to machine learning, there are two key metrics to keep an eye on at all times: the training error and the test error. Training error is a measure of how well your model performed during training, while test error is a measure of how well it performed in the wild.

Our goal while developing an algorithm is to produce a model that works well in the real world. We don’t mind if our model fails miserably in training as long as it performs well in the real world.

In reality, the only reason we care about our training error is because it can provide insight into how it will perform in testing. If that link is broken, then quantifying our training error is no longer useful.

Because our data will undoubtedly contain some noise, that link will eventually break down. The real world rarely follows perfect curves, and even when it does, our measurements of those curves are often inaccurate. Consider rainfall measurement: we can sample the data to get a decent estimate of how much rain fell, but do we truly believe the exact amount of rain fell everywhere in a multi-mile radius? That would be insane.

As a result, it’s irrational to believe that perfectly fitting your data in training would result in equally good test results. Consider the following example of a dataset.

Now we can draw a line that perfectly fits the data — and many algorithms are particularly good at finding sophisticated solutions to this problem. We could use a nonlinear regression method to find something similar to this.

Do you see what’s going on? In training, we were able to perfectly fit the data – but why would we expect this to function in the real world?

Let’s say we start by evaluating the model on the green-colored test data.

Your model may appear to be cutting a good swath through the data, but we need to quantify the test point error. The error is commonly assessed in linear regression by the Euclidean distance from the hyperplane at the same X value. What will that entail?

Even though the training error is zero, there’s no denying that the test error measures are massive. The essence of the issue is that we overfit the training data at the expense of real-world performance.

Getting the Model to Stand Up Straight

Because we allowed our model to find sophisticated answers to basic issues, it was able to overfit the data. Although that explanation is oversimplified, the intuition is there. What if we merely look for easier ways to deal with simple data?

Regularization is the term for this notion. The idea is that we alter the error measure to indicate a preference for simple solutions, without getting too deep into the arithmetic. We measure the model’s complexity and penalise it for coming up with strange solutions like the one we saw above, in addition to analysing its performance on the data. This is stated as a sum of the training error plus a complexity penalty, which is called an error measure.

We can guide our model towards simpler answers by punishing it for its complexity. By doing so, we raise the model’s bias while lowering the variance considerably. Because the larger source of error in the model is generally variance, the tiny price we pay in increased bias will result in a net gain in performance.

Consider the data we wanted to train on in the first place. What if we train our model to prefer simple solutions, such as a straight line?

Because we aren’t cutting through each point completely, our training error is obviously bigger, but the complexity penalty is much, much smaller than it may be for the other model. What does it look like when we use our test data?

It’s not bad. What about the red and green error bars?

It’s clear that the average error bar has shrunk dramatically. We constructed a model that performs better in the actual world just by adopting an uncomplicated solution. It’s a little strange, but it works.

Regularization is handled differently by each machine learning method, but in general, it works by limiting the range of values that model parameters can take. This is accomplished by employing a hyperparameter, which is a parameter that the user selects.

Leave a Thoughtful Comment
X

Read 0 comments and reply

Top Contributors Latest

NeilCummings225  |  Contribution: 2,665