# Underfitting & Overfitting

Remember that the main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on unknown data. As you can notice the words ‘Overfitting’ and ‘Underfitting’ are kind of opposite of the term ‘Generalization’. Overfitting and underfitting models don’t generalize well and results in poor performance.

## Underfitting

• Underfitting occurs when machine learning model don’t fit the training data well enough. It is usually caused by simple function that cannot capture the underlying trend in the data.
• Underfitting models have high error in training as well as test set. This behavior is called as ‘Low Bias’
• This usually happens when we try to fit linear function for non-linear data.
• Since underfitting models don’t perform well on training set, it’s very easy to detect underfitting

### How To Avoid Underfitting?

• Increasing the model complexity. e.g. If linear function under fit then try using polynomial features
• Increase the number of features by performing the feature engineering

### Example

Please refer my Multiple Linear Regression Fish Weight Prediction Kaggle notebook. In this study I am using linear function, which is not fitting the data well. Though model score is on higher side, but one major issue with prediction is negative weight values. This behavior is true for smaller(less than 20gm) weight values.

## Overfitting

• Overfitting occurs when machine learning model tries to fit the training data too well. It is usually caused by complicated function that creates lots of unnecessary curves and angles that are not related with data and end up capturing the noise in data.
• Overfitting models have low error in training set but high error in test set. This behavior is called as ‘High Variance’

### How To Avoid Overfitting?

• Since overfitting algorithm captures the noise in data, reducing the number of features will help. We can manually select only important features or can use model selection algorithm for same
• We can also use the ‘Regularization’ technique. It works well when we have lots of slightly useful features. Sklearn linear model(Ridge and LASSO) uses regularization parameter ‘alpha’ to control the size of the coefficients by imposing a penalty. Please refer below tutorials for more details.
• K-fold cross validation. In this technique we divide the training data in multiple batches and use each batch for training and testing the model.
• Increasing the training data also helps to avoid overfitting.

### Example

Please refer my Polynomial Linear Regression Fish Wgt Prediction Kaggle notebook. In this study I am using quadratic function, to make it overfitting model you can try 10th degree function and check the results.

## Good Fitting

• It is a sweet spot between Underfitting and Overfitting model
• A good fitting model generalizes the learnings from training data and provide accurate predictions on new data
• To get the good fitting model, keep training and testing the model till you get the minimum train and test error. Here important parameter is ‘test error’ because low train error may cause overfitting so always keep an eye on test error fluctuations. The sweet spot is just before the test error start to rise.

### Example

Please refer my Multiclass Logistic Regression. In this study I am using Linear Model from Sklearn library to perform Multi Class Logistic Regression on handwritten digit’s dataset. Notice the algorithm selection and model performance analysis.

## Learning Path for DP-900 Microsoft Azure Data Fundamentals Certification

Learning path to gain necessary skills and to clear the Azure Data Fundamentals Certification. This certification is intended for candidates beginning to wor...

## Learning Path for AI-900 Microsoft Azure AI Fundamentals Certification

Learning path to gain necessary skills and to clear the Azure AI Fundamentals Certification. This certification is intended for candidates with both technica...

## ANN Model to Classify Images

In this guide we are going to create and train the neural network model to classify the clothing images. We will use TensorFlow deep learning framework along...

## Introduction to NLP

In short NLP is an AI technique used to do text analysis. Whenever we have lots of text data to analyze we can use NLP. Apart from text analysis, NLP also us...

## K Fold Cross Validation

There are multiple ways to split the data for model training and testing, in this article we are going to cover K Fold and Stratified K Fold cross validation...

## K-Means Clustering

K-Means clustering is most commonly used unsupervised learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters...

## Time Series Analysis and Forecasting

Any data recorded with some fixed interval of time is called as time series data. This fixed interval can be hourly, daily, monthly or yearly. Objective of t...

## Support Vector Machines

Support vector machines is one of the most powerful ‘Black Box’ machine learning algorithm. It belongs to the family of supervised learning algorithm. Used t...

## Random Forest

Random forest is supervised learning algorithm and can be used to solve classification and regression problems. Unlike decision tree random forest fits multi...

## Decision Tree

Decision tree explained using classification and regression example. The objective of decision tree is to split the data in such a way that at the end we hav...

## Agile Scrum Framework

This tutorial covers basic Agile principles and use of Scrum framework in software development projects.

## Underfitting & Overfitting

Main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on un...

## Multiclass Logistic Regression Using Sklearn

In this study we are going to use the Linear Model from Sklearn library to perform Multi class Logistic Regression. We are going to use handwritten digit’s d...

## Binary Logistic Regression Using Sklearn

In this tutorial we are going to use the Logistic Model from Sklearn library. We are also going to use the same test data used in Logistic Regression From Sc...

## Logistic Regression From Scratch With Python

This tutorial covers basic concepts of logistic regression. I will explain the process of creating a model right from hypothesis function to algorithm. We wi...

## Train Test Split

In this tutorial we are going to study about train, test data split. We will use sklearn library to do the data split.

## One Hot Encoding

In this tutorial we are going to study about One Hot Encoding. We will also use pandas and sklearn libraries to convert categorical data into numeric data.

## Multivariate Linear Regression Using Scikit Learn

In this tutorial we are going to use the Linear Models from Sklearn library. Scikit-learn is one of the most popular open source machine learning library for...

## Univariate Linear Regression Using Scikit Learn

In this tutorial we are going to use the Linear Models from Sklearn library. Scikit-learn is one of the most popular open source machine learning library for...

## Multivariate Linear Regression From Scratch With Python

In this tutorial we are going to cover linear regression with multiple input variables. We are going to use same model that we have created in Univariate Lin...