K Fold Cross Validation

KFold_Cross_Validation_Header_1000x690

Index

Introduction

As of now we have divided the input data into train and test datasets and use it for model training and testing respectively. This method is not very reliable as train and test data not always have same kind of variation like original data, which will affect the accuracy of the model. Cross validation solves this problem by dividing the input data into multiple groups instead of just two groups. There are multiple ways to split the data, in this article we are going to cover K Fold and Stratified K Fold cross validation techniques.

In case you are not familiar with train test split method, please refer this article.

Inner Working of Cross Validation

  • Shuffle the dataset in order to remove any kind of order
  • Split the data into K number of folds. K= 5 or 10 will work for most of the cases.
  • Now keep one fold for testing and remaining all the folds for training.
  • Train(fit) the model on train set and test(evaluate) it on test set and note down the results for that split
  • Now repeat this process for all the folds, every time choosing separate fold as test data
  • So for every iteration our model gets trained and tested on different sets of data
  • At the end sum up the scores from each split and get the mean score

Inner_Working_KFold

K Fold Cross Validation

In case of K Fold cross validation input data is divided into ‘K’ number of folds, hence the name K Fold. Suppose we have divided data into 5 folds i.e. K=5. Now we have 5 sets of data to train and test our model. So the model will get trained and tested 5 times, but for every iteration we will use one fold as test data and rest all as training data. Note that for every iteration, data in training and test fold changes which adds to the effectiveness of this method.

This significantly reduces underfitting as we are using most of the data for training(fitting), and also significantly reduces overfitting as most of the data is also being used in validation set. K Fold cross validation helps to generalize the machine learning model, which results in better predictions on unknown data. To know more about underfitting & overfitting please refer this article.

For most of the cases 5 or 10 folds are sufficient but depending on problem you can split the data into any number of folds.

KFold_Cross_Validation

Stratified K Fold Cross Validation

Stratified K Fold used when just random shuffling and splitting the data is not sufficient, and we want to have correct distribution of data in each fold. In case of regression problem folds are selected so that the mean response value is approximately equal in all the folds. In case of classification problem folds are selected to have same proportion of class labels. Stratified K Fold is more useful in case of classification problems, where it is very important to have same percentage of labels in every fold.

Stratified_KFold_Cross_Validation

Hyperparameter Tuning and Model Selection

Now you are familiar with inner working of cross validation, lets see how we can use it to tune the parameters and select best model.

For hyperparameter tuning or to find the best model we have to run the model against multiple combination of parameters and features and record score for analysis. To do this we can use sklearns ‘cross_val_score’ function. This function evaluates a score by cross-validation, and depending on the scores we can finalize the hyperparameter which provides the best results. Similarly, we can try multiple model and choose the model which provides the best score.

Note: In this article I will do the model parameter tuning using for loop for better understanding. There are more sophisticated ways like using GridSearchCV() to do hyperparameter tuning. To know more about it please refer this article

Advantages

  • We end up using all the data for training and testing and this is very useful in case of small datasets
  • It covers the variation of input data by validating the performance of the model on multiple folds
  • Multiple folds also helps in case of unbalanced data
  • Model performance analysis for every fold gives us more insights to fine tune the model
  • Used for hyperparameter tuning

Disadvantages

K Fold cross validation not really helpful in case time series data. To know more about time series data please refer this tutorial

K Fold: Regression Example

We are going to use House Prices: Advanced Regression Techniques competition data. We will convert this dataset into toy dataset so that we can straightaway jump into model building using K Fold cross validation

House_Prices_Advanced_Regression_Techniques

Import Libraries

  • pandas: Used for data manipulation and analysis
  • numpy : Numpy is the core library for scientific computing in Python. It is used for working with arrays and matrices.
  • KFold: Sklearn K-Folds cross-validator
  • StratifiedKFold: Stratified K-Folds cross-validator
  • cross_val_score: Sklearn library to evaluate a score by cross-validation
  • linear_model: Sklearn library, we are using LinearRegression and LogisticRegression algorithm
  • tree: Sklearn library, we are using DecisionTreeRegressor and DecisionTreeClassifier
  • ensemble: SKlearn library, we are using RandomForestRegressor and RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn import linear_model, tree, ensemble

Load Dataset

We will load the dataset into pandas dataframe and convert it into a toy dataset by removing categorical columns and rows and columns with null values.

train_data = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

# Remove rows with missing target values
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice # Target variable             
train_data.drop(['SalePrice'], axis=1, inplace=True) # Removing target variable from training data

train_data.drop(['LotFrontage', 'GarageYrBlt', 'MasVnrArea'], axis=1, inplace=True) # Remove columns with null values

# Select numeric columns only
numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()

print("Shape of input data: {} and shape of target variable: {}".format(X.shape, y.shape))

X.head() # Show first 5 training examples
Shape of input data: (1460, 34) and shape of target variable: (1460,)
Id MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
0 1 60 8450 7 5 2003 2003 706 0 150 ... 548 0 61 0 0 0 0 0 2 2008
1 2 20 9600 6 8 1976 1976 978 0 284 ... 460 298 0 0 0 0 0 0 5 2007
2 3 60 11250 7 5 2001 2002 486 0 434 ... 608 0 42 0 0 0 0 0 9 2008
3 4 70 9550 7 5 1915 1970 216 0 540 ... 642 0 35 272 0 0 0 0 2 2006
4 5 60 14260 8 5 2000 2000 655 0 490 ... 836 192 84 0 0 0 0 0 12 2008

5 rows × 34 columns

Understanding the Data

Final dataset contains 34 features and 1460 training examples. We have to predict the house sales price based on available training data.

Model Score Using KFold

Let’s use cross_val_score() to evaluate a score by cross-validation. We are going to use three different models for analysis. We will find the score for every split and then take average to get the overall score. We will analyze the model performance based on Root Mean Square Error (RMSE). Since RMSE is not directly available from scoring parameter, first we find the Mean Square Error and then take the square root of it.

# Lets split the data into 5 folds.  
# We will use this 'kf'(KFold splitting stratergy) object as input to cross_val_score() method
kf =KFold(n_splits=5, shuffle=True, random_state=42)

cnt = 1
# split()  method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    cnt += 1
Fold:1, Train set: 1168, Test set:292
Fold:2, Train set: 1168, Test set:292
Fold:3, Train set: 1168, Test set:292
Fold:4, Train set: 1168, Test set:292
Fold:5, Train set: 1168, Test set:292
"""
Why we are using '-' sign to calculate RMSE?
ANS: Classification accuracy is reward function, means something you want to maximize. Mean Square Error is loss function, 
means something you want to minimize. Now if we use 'cross_val_score' function then best score(high value) will give worst 
model in case of loss function! There are other sklearn functions which also depends on 'cross_val_score' to select best model by
looking for highest scores, so a design decision was made for 'cross_val_score' to negate the output of all loss function. 
So that when other sklearn function calls 'cross_val_score' those function can always assume that highest score indicate better model.
In short ignore the negative sign and rate the error based on its absolute value.
"""
def rmse(score):
    rmse = np.sqrt(-score)
    print(f'rmse= {"{:.2f}".format(rmse)}')

Using Linear Regression

score = cross_val_score(linear_model.LinearRegression(), X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold: {score}')
rmse(score.mean())
Scores for each fold: [-1.39334669e+09 -1.32533433e+09 -3.39493937e+09 -9.31045536e+08
 -7.16620849e+08]
rmse= 39398.70

Using Decision Tree Regressor

score = cross_val_score(tree.DecisionTreeRegressor(random_state= 42), X, y, cv=kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold: {score}')
rmse(score.mean())
Scores for each fold: [-2.28396934e+09 -1.70193863e+09 -2.50505513e+09 -1.48547479e+09
 -1.66691378e+09]
rmse= 43916.63

Using Random Forest Regressor

score = cross_val_score(ensemble.RandomForestRegressor(random_state= 42), X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold are: {score}')
rmse(score.mean())
Scores for each fold are: [-8.58316418e+08 -6.13821216e+08 -2.06121160e+09 -7.97273029e+08
 -5.68429309e+08]
rmse= 31301.92

Model Tuning using KFold

We can also use cross_val_score() along with KFold to evaluate the model for different hyperparameters. Here we are going to try different hyperparameter values and choose the ones for which we get the highest model score.

Decision Tree Regressor Tuning

There are multiple hyperparameters like max_depth, min_samples_split, min_samples_leaf etc which affect the model performance. Here we are going to do tuning based on ‘max_depth’. We will try with max depth starting from 1 to 10 and depending on the final ‘rmse’ score choose the value of max_depth.

max_depth = [1,2,3,4,5,6,7,8,9,10]

for val in max_depth:
    score = cross_val_score(tree.DecisionTreeRegressor(max_depth= val, random_state= 42), X, y, cv= kf, scoring="neg_mean_squared_error")
    print(f'For max depth: {val}')
    rmse(score.mean())
For max depth: 1
rmse= 58803.64
For max depth: 2
rmse= 50060.31
For max depth: 3
rmse= 42152.85
For max depth: 4
rmse= 39218.54
For max depth: 5
rmse= 40185.90
For max depth: 6
rmse= 40522.15
For max depth: 7
rmse= 41089.08
For max depth: 8
rmse= 41161.27
For max depth: 9
rmse= 41441.94
For max depth: 10
rmse= 41758.39

Random Forest Regressor Tuning

There are multiple hyperparameters like n_estimators, max_depth, min_samples_split etc which affect the model performance. Here we are going to do tuning based on ‘n_estimators’. We will try with estimators starting from 50 to 350 and depending on the final ‘rmse’ score choose the value of estimator.

estimators = [50, 100, 150, 200, 250, 300, 350]

for count in estimators:
    score = cross_val_score(ensemble.RandomForestRegressor(n_estimators= count, random_state= 42), X, y, cv= kf, scoring="neg_mean_squared_error")
    print(f'For estimators: {count}')
    rmse(score.mean())
For estimators: 50
rmse= 31450.86
For estimators: 100
rmse= 31301.92
For estimators: 150
rmse= 31187.45
For estimators: 200
rmse= 31176.16
For estimators: 250
rmse= 31246.61
For estimators: 300
rmse= 31242.74
For estimators: 350
rmse= 31313.74

K Fold: Classification Example

We are going to use Titanic: Machine Learning from Disaster competition data. We will convert this dataset into toy dataset so that we can straightaway jump into model building using K Fold cross validation

Titanic_Machine_Learning_from_Disaster

Load Dataset

We will load the dataset into pandas dataframe and convert it into a toy dataset by removing categorical columns and rows and columns with null values.

train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

# Remove rows with missing target values
train_data.dropna(axis=0, subset=['Survived'], inplace=True)
y = train_data.Survived # Target variable             
train_data.drop(['Survived'], axis=1, inplace=True) # Removing target variable from training data

train_data.drop(['Age'], axis=1, inplace=True) # Remove columns with null values

# Select numeric columns only
numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()

print("Shape of input data: {} and shape of target variable: {}".format(X.shape, y.shape))
pd.concat([X, y], axis=1).head() # Show first 5 training examples
Shape of input data: (891, 5) and shape of target variable: (891,)
PassengerId Pclass SibSp Parch Fare Survived
0 1 3 1 0 7.2500 0
1 2 1 1 0 71.2833 1
2 3 3 0 0 7.9250 1
3 4 1 1 0 53.1000 1
4 5 3 0 0 8.0500 0

Understanding the Data

Final dataset contains 5 features and 891 training examples. We have to predict which passengers survived the Titanic shipwreck based on available training data. Features that we are going to use in this example are passenger id, ticket class, sibling/spouse aboard, parent/children aboard and ticket fare

Model Score Using KFold

Let’s use cross_val_score() to evaluate a score by cross-validation. We are going to use three different models for analysis. We are going to find the score for every fold and then take average to get the overall score. We will analyze the model performance based on accuracy score, here score value indicate how many predictions are matching with actual values.

# Lets split the data into 5 folds. 
# We will use this 'kf'(StratiFiedKFold splitting stratergy) object as input to cross_val_score() method
# The folds are made by preserving the percentage of samples for each class.
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cnt = 1
# split()  method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    cnt+=1
    
# Note that: 
# cross_val_score() parameter 'cv' will by default use StratifiedKFold spliting startergy if we just specify value of number of folds. 
# So you can bypass above step and just specify cv= 5 in cross_val_score() function
Fold:1, Train set: 712, Test set:179
Fold:2, Train set: 713, Test set:178
Fold:3, Train set: 713, Test set:178
Fold:4, Train set: 713, Test set:178
Fold:5, Train set: 713, Test set:178

Using Logistic Regression

score = cross_val_score(linear_model.LogisticRegression(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.66480447 0.69662921 0.70224719 0.69101124 0.66292135]
Average score: 0.68

Using Decision Classifier

score = cross_val_score(tree.DecisionTreeClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.67039106 0.61235955 0.5505618  0.64044944 0.69101124]
Average score: 0.63

Using Random Forest Classifier

score = cross_val_score(ensemble.RandomForestClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.74301676 0.66292135 0.65730337 0.70786517 0.73033708]
Average score: 0.70

Model Tuning using KFold

We can also use cross_val_score() along with StratifiedKFold to evaluate the model for different hyperparameters. Here we are going to try different hyperparameter values and choose the ones for which we get the highest model score.

Logistic Classifier Tuning

We will try different optimization algorithm to finalize the one with the highest accuracy.

algorithms = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

for algo in algorithms:
    score = cross_val_score(linear_model.LogisticRegression(max_iter= 4000, solver= algo, random_state= 42), X, y, cv= kf, scoring="accuracy")
    print(f'Average score({algo}): {"{:.3f}".format(score.mean())}')
    
# Note, here we are using max_iter = 4000, so that all the solver gets chance to converge. 
Average score(newton-cg): 0.684
Average score(lbfgs): 0.684
Average score(liblinear): 0.684
Average score(sag): 0.678
Average score(saga): 0.681

Decision Tree Classifier Tuning

Here we are going to do tuning based on ‘max_depth’. We will try with max depth starting from 1 to 10 and depending on the final ‘accuracy’ score choose the value of max_depth.

max_depth = [1,2,3,4,5,6,7,8,9,10]

for val in max_depth:
    score = cross_val_score(tree.DecisionTreeClassifier(max_depth= val, random_state= 42), X, y, cv= kf, scoring="accuracy")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')
Average score(1): 0.668
Average score(2): 0.706
Average score(3): 0.713
Average score(4): 0.687
Average score(5): 0.688
Average score(6): 0.682
Average score(7): 0.669
Average score(8): 0.669
Average score(9): 0.663
Average score(10): 0.664

Random Forest Classifier Tuning

Here we are going to do tuning based on ‘n_estimators’. We will try with estimators starting from 50 to 350 and depending on the final ‘rmse’ score, choose the value of estimator.

n_estimators = [50, 100, 150, 200, 250, 300, 350]

for val in n_estimators:
    score = cross_val_score(ensemble.RandomForestClassifier(n_estimators= val, random_state= 42), X, y, cv= kf, scoring="accuracy")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')

Reference

2020

ANN Model to Classify Images

12 minute read

In this guide we are going to create and train the neural network model to classify the clothing images. We will use TensorFlow deep learning framework along...

Introduction to NLP

8 minute read

In short NLP is an AI technique used to do text analysis. Whenever we have lots of text data to analyze we can use NLP. Apart from text analysis, NLP also us...

K Fold Cross Validation

13 minute read

There are multiple ways to split the data for model training and testing, in this article we are going to cover K Fold and Stratified K Fold cross validation...

K-Means Clustering

12 minute read

K-Means clustering is most commonly used unsupervised learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters...

Time Series Analysis and Forecasting

10 minute read

Any data recorded with some fixed interval of time is called as time series data. This fixed interval can be hourly, daily, monthly or yearly. Objective of t...

Support Vector Machines

9 minute read

Support vector machines is one of the most powerful ‘Black Box’ machine learning algorithm. It belongs to the family of supervised learning algorithm. Used t...

Random Forest

11 minute read

Random forest is supervised learning algorithm and can be used to solve classification and regression problems. Unlike decision tree random forest fits multi...

Decision Tree

14 minute read

Decision tree explained using classification and regression example. The objective of decision tree is to split the data in such a way that at the end we hav...

Agile Scrum Framework

7 minute read

This tutorial covers basic Agile principles and use of Scrum framework in software development projects.

Underfitting & Overfitting

2 minute read

Main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on un...

Binary Logistic Regression Using Sklearn

5 minute read

In this tutorial we are going to use the Logistic Model from Sklearn library. We are also going to use the same test data used in Logistic Regression From Sc...

Train Test Split

3 minute read

In this tutorial we are going to study about train, test data split. We will use sklearn library to do the data split.

One Hot Encoding

11 minute read

In this tutorial we are going to study about One Hot Encoding. We will also use pandas and sklearn libraries to convert categorical data into numeric data.

Back to top ↑