Underfitting & Overfitting

Underfitting_overfitting_header_1.png

Remember that the main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on unknown data. As you can notice the words ‘Overfitting’ and ‘Underfitting’ are kind of opposite of the term ‘Generalization’. Overfitting and underfitting models don’t generalize well and results in poor performance.

Underfitting

  • Underfitting occurs when machine learning model don’t fit the training data well enough. It is usually caused by simple function that cannot capture the underlying trend in the data.
  • Underfitting models have high error in training as well as test set. This behavior is called as ‘Low Bias’
  • This usually happens when we try to fit linear function for non-linear data.
  • Since underfitting models don’t perform well on training set, it’s very easy to detect underfitting

    Underfitting.png

How To Avoid Underfitting?

  • Increasing the model complexity. e.g. If linear function under fit then try using polynomial features
  • Increase the number of features by performing the feature engineering

Example

Please refer my Multiple Linear Regression Fish Weight Prediction Kaggle notebook. In this study I am using linear function, which is not fitting the data well. Though model score is on higher side, but one major issue with prediction is negative weight values. This behavior is true for smaller(less than 20gm) weight values.

Overfitting

  • Overfitting occurs when machine learning model tries to fit the training data too well. It is usually caused by complicated function that creates lots of unnecessary curves and angles that are not related with data and end up capturing the noise in data.
  • Overfitting models have low error in training set but high error in test set. This behavior is called as ‘High Variance’

    Overfitting.png

How To Avoid Overfitting?

  • Since overfitting algorithm captures the noise in data, reducing the number of features will help. We can manually select only important features or can use model selection algorithm for same
  • We can also use the ‘Regularization’ technique. It works well when we have lots of slightly useful features. Sklearn linear model(Ridge and LASSO) uses regularization parameter ‘alpha’ to control the size of the coefficients by imposing a penalty. Please refer below tutorials for more details.
  • K-fold cross validation. In this technique we divide the training data in multiple batches and use each batch for training and testing the model.
  • Increasing the training data also helps to avoid overfitting.

Example

Please refer my Polynomial Linear Regression Fish Wgt Prediction Kaggle notebook. In this study I am using quadratic function, to make it overfitting model you can try 10th degree function and check the results.

Good Fitting

  • It is a sweet spot between Underfitting and Overfitting model
  • A good fitting model generalizes the learnings from training data and provide accurate predictions on new data
  • To get the good fitting model, keep training and testing the model till you get the minimum train and test error. Here important parameter is ‘test error’ because low train error may cause overfitting so always keep an eye on test error fluctuations. The sweet spot is just before the test error start to rise.

    Goodfit.png

Example

Please refer my Multiclass Logistic Regression. In this study I am using Linear Model from Sklearn library to perform Multi Class Logistic Regression on handwritten digit’s dataset. Notice the algorithm selection and model performance analysis.

2020

ANN Model to Classify Images

12 minute read

In this guide we are going to create and train the neural network model to classify the clothing images. We will use TensorFlow deep learning framework along...

Introduction to NLP

8 minute read

In short NLP is an AI technique used to do text analysis. Whenever we have lots of text data to analyze we can use NLP. Apart from text analysis, NLP also us...

K Fold Cross Validation

14 minute read

There are multiple ways to split the data for model training and testing, in this article we are going to cover K Fold and Stratified K Fold cross validation...

K-Means Clustering

13 minute read

K-Means clustering is most commonly used unsupervised learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters...

Time Series Analysis and Forecasting

10 minute read

Any data recorded with some fixed interval of time is called as time series data. This fixed interval can be hourly, daily, monthly or yearly. Objective of t...

Support Vector Machines

9 minute read

Support vector machines is one of the most powerful ‘Black Box’ machine learning algorithm. It belongs to the family of supervised learning algorithm. Used t...

Random Forest

12 minute read

Random forest is supervised learning algorithm and can be used to solve classification and regression problems. Unlike decision tree random forest fits multi...

Decision Tree

13 minute read

Decision tree explained using classification and regression example. The objective of decision tree is to split the data in such a way that at the end we hav...

Agile Scrum Framework

7 minute read

This tutorial covers basic Agile principles and use of Scrum framework in software development projects.

Underfitting & Overfitting

2 minute read

Main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on un...

Binary Logistic Regression Using Sklearn

5 minute read

In this tutorial we are going to use the Logistic Model from Sklearn library. We are also going to use the same test data used in Logistic Regression From Sc...

Train Test Split

3 minute read

In this tutorial we are going to study about train, test data split. We will use sklearn library to do the data split.

One Hot Encoding

11 minute read

In this tutorial we are going to study about One Hot Encoding. We will also use pandas and sklearn libraries to convert categorical data into numeric data.

Back to top ↑