Random Forest

Random_Forest_Header

Random forest is supervised learning algorithm and can be used to solve classification and regression problems. Since decision-tree create only one tree to fit the dataset, it may cause overfitting and model may not generalize well. Unlike decision tree random forest fits multiple decision trees on various sub samples of dataset and make the predictions by averaging the predictions from each tree. Averaging the results from multiple decision trees help to control the overfitting and results in much better prediction accuracy. As you may have noticed, since this algorithm uses multiple trees hence the name ‘Random Forest’

This tutorial is part of my ‘Beginner Series Tutorials’, I would recommend you to please go through Decision Tree tutorial first for better understanding.

Note: Source code used in this article is available at this Kaggle Kernel

Inner Workings Of Random Forest

  • Select few random sub sample from given dataset
  • Construct a decision tree for every sub sample and predict the result. To know more about ‘decision tree’ formation please refer Inner Workings Of Decision Tree
  • Perform the voting on prediction from each tree
  • At the end select the most voted result as final prediction

For more details about how Random forest classifier splits the data, please refer Criteria To Split The Data

Random_Forest

How Do Random Forest Handle Missing Data?

  • Please refer above diagram where we have training data set of circle, square and triangle of color red, green and blue respectively.
  • There are total 27 training examples. Random forest will create three sub sample of 9 training examples each
  • Random forest algorithm will create three different decision tree for each sub sample
  • Notice that each tree uses different criteria to split the data
  • Now it is straight forward analysis for the algorithm to predict the shape of given figure if its shape and color is known. Let’s check the predictions of each tree for blue color triangle,
    • Tree 1 will predict: triangle
    • Tree 2 will predict: square
    • Tree 2 will predict: triangle Since the majority of voting is for triangle final prediction is ‘triangle shape’
  • Now, lets check predictions for circle with no color defined (color attribute is missing here)
    • Tree 1 will predict: triangle
    • Tree 2 will predict: circle
    • Tree 2 will predict: circle Since the majority of voting is for circle final prediction is ‘circle shape’
  • Please note this is over simplified example, but you get an idea how multiple tree with different split criteria helps to handle missing features

Advantages Of Random Forest

  • Reduces the model overfitting by averaging the results from multiple decision trees
  • High level of accuracy
  • Works well in case of missing data
  • Repeated model training is not required

Disadvantages Of Random Forest

  • Random forest generates complex models which are difficult to understand and interpret
  • More time and computational resources required as compare to Decision Tree
  • Predictions are slower than decision tree

Classification Problem Example

For classification exercise we are going to use sklearns wine recognition dataset. Objective is to classify wines among three categories based on available data. The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

Understanding Dataset

  • wine.DESCR > Complete description of dataset
  • wine.data > Data to learn. Each training set is 13 digit array of features.
    • Total training examples 178.
    • Samples per class [59,71,48]
  • wine.feature_names > Array of all 13 feature. Features are as below
    • Alcohol
    • Malic acid
    • Ash
    • Alcalinity of ash
    • Magnesium
    • Total phenols
    • Flavanoids
    • Nonflavanoid phenols
    • Proanthocyanins
    • Color intensity
    • Hue
    • OD280/OD315 of diluted wines
    • Proline
  • wine.target > The classification label. For every training set there is one classification label(0, 1, 2). Here 0 for class_0, 1 for class_1 and 2 for class_2
  • wine.filename > CSV file name
  • wine.target_names > Name of the classes. It’s an array [‘class_0’, ‘class_1’, ‘class_2’]

Import The Libraries

  • pandas: Used for data manipulation and analysis
  • numpy : Numpy is the core library for scientific computing in Python. It is used for working with arrays and matrices.
  • datasets: Here we are going to use ‘wine’ and ‘boston house prices’ datasets
  • model_selection: Here we are going to use model_selection.train_test_split() for splitting the data
  • ensemble: Here we are going to use random forest classifier and regressor
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn import model_selection
from sklearn import ensemble

Load The Dataset

wine = datasets.load_wine()
print('Dataset structure= ', dir(wine))

df = pd.DataFrame(wine.data, columns = wine.feature_names)
df['target'] = wine.target
df['wine_class'] = df.target.apply(lambda x : wine.target_names[x]) # Each value from 'target' is used as index to get corresponding value from 'target_names' 

print('Unique target values=',df['target'].unique())

df.head()
Dataset structure=  ['DESCR', 'data', 'feature_names', 'target', 'target_names']
Unique target values= [0 1 2]
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target wine_class
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0 class_0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0 class_0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0 class_0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0 class_0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0 class_0

Let visualize the feature values for each type of wine

# label = 0 (wine class_0)
df[df.target == 0].head(3)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target wine_class
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0 class_0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0 class_0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0 class_0
# label = 1 (wine class_1)
df[df.target == 1].head(3)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target wine_class
59 12.37 0.94 1.36 10.6 88.0 1.98 0.57 0.28 0.42 1.95 1.05 1.82 520.0 1 class_1
60 12.33 1.10 2.28 16.0 101.0 2.05 1.09 0.63 0.41 3.27 1.25 1.67 680.0 1 class_1
61 12.64 1.36 2.02 16.8 100.0 2.02 1.41 0.53 0.62 5.75 0.98 1.59 450.0 1 class_1
# label = 2 (wine class_2)
df[df.target == 2].head(3)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target wine_class
130 12.86 1.35 2.32 18.0 122.0 1.51 1.25 0.21 0.94 4.1 0.76 1.29 630.0 2 class_2
131 12.88 2.99 2.40 20.0 104.0 1.30 1.22 0.24 0.83 5.4 0.74 1.42 530.0 2 class_2
132 12.81 2.31 2.40 24.0 98.0 1.15 1.09 0.27 0.83 5.7 0.66 1.36 560.0 2 class_2

Build Machine Learning Model

#Lets create feature matrix X  and y labels
X = df[['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium','total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']]
y = df[['target']]

print('X shape=', X.shape)
print('y shape=', y.shape)
X shape= (178, 13)
y shape= (178, 1)

Create Test And Train Dataset

  • We will split the dataset, so that we can use one set of data for training the model and one set of data for testing the model
  • We will keep 20% of data for testing and 80% of data for training the model *If you want to learn more about it, please refer Train Test Split tutorial
X_train,X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size= 0.2, random_state= 1)
print('X_train dimension= ', X_train.shape)
print('X_test dimension= ', X_test.shape)
print('y_train dimension= ', y_train.shape)
print('y_train dimension= ', y_test.shape)
X_train dimension=  (142, 13)
X_test dimension=  (36, 13)
y_train dimension=  (142, 1)
y_train dimension=  (36, 1)

Now lets train the model using Random Forest Classification Algorithm

"""
To obtain a deterministic behaviour during fitting always set value for 'random_state' attribute
Also note that default value of criteria to split the data is 'gini'
"""
rfc = ensemble.RandomForestClassifier(random_state = 1)
rfc.fit(X_train ,y_train.values.ravel()) # Using ravel() to convert column vector y to 1D array 
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

Testing The Model

  • For testing we are going to use the test data only
  • Question: Predict the wine class of 10th and 30th from test data
print('Actual Wine type for 10th test data sample= ', wine.target_names[y_test.iloc[10]][0])
print('Wine type prediction for 10th test data sample= ',wine.target_names[rfc.predict([X_test.iloc[10]])][0])

print('Actual Wine type for 30th test data sample= ', wine.target_names[y_test.iloc[30]][0])
print('Wine type prediction for 30th test data sample= ',wine.target_names[rfc.predict([X_test.iloc[30]])][0])
Actual Wine type for 10th test data sample=  class_0
Wine type prediction for 10th test data sample=  class_0
Actual Wine type for 30th test data sample=  class_1
Wine type prediction for 30th test data sample=  class_1

Model Score

Check the model score using test data

rfc.score(X_test, y_test)
0.9722222222222222

Regression Problem Example

For regression exercise we are going to use sklearns Boston house prices dataset. Objective is to predict house price based on available data

Note: I have used same dataset for decision tree regressor example, model score was 66%. If you are interested please refer decision tree implementation of this problem at Kaggle Notebook or at my blog Blog

Understanding the Boston house dataset

  • boston.DESCR > Complete description of dataset
  • boston.data > Data to learn. There are 13 features, Attribute 14 is the target. Total 506 training sets
    • CRIM per capita crime rate by town
    • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
    • INDUS proportion of non-retail business acres per town
    • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    • NOX nitric oxides concentration (parts per 10 million)
    • RM average number of rooms per dwelling
    • AGE proportion of owner-occupied units built prior to 1940
    • DIS weighted distances to five Boston employment centers
    • RAD index of accessibility to radial highways
    • TAX full-value property-tax rate per USD 10,000
    • PTRATIO pupil-teacher ratio by town
    • B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    • LSTAT % lower status of the population
    • MEDV Median value of owner-occupied homes in USD 1000’s
  • boston.feature_names > Array of all 13 features [‘CRIM’ ‘ZN’ ‘INDUS’ ‘CHAS’ ‘NOX’ ‘RM’ ‘AGE’ ‘DIS’ ‘RAD’ ‘TAX’ ‘PTRATIO’ ‘B’ ‘LSTAT’]
  • boston.filename > CSV file name
  • boston.target > The price valueis in $1000’s

From above details its clear that X = ‘boston.data’ and y= ‘boston.target’

Lod The Data

boston = datasets.load_boston()
print('Dataset structure= ', dir(boston))

df = pd.DataFrame(boston.data, columns = boston.feature_names)
df['target'] = boston.target

df.head()
Dataset structure=  ['DESCR', 'data', 'feature_names', 'filename', 'target']
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Build Machine Learning Model

#Lets create feature matrix X  and y labels
X = df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']]
y = df[['target']]

print('X shape=', X.shape)
print('y shape=', y.shape)
X shape= (506, 13)
y shape= (506, 1)

Create Test And Train Dataset

  • We will split the dataset, so that we can use one set of data for training the model and one set of data for testing the model
  • We will keep 20% of data for testing and 80% of data for training the model *If you want to learn more about it, please refer Train Test Split tutorial
X_train,X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size= 0.2, random_state= 1)
print('X_train dimension= ', X_train.shape)
print('X_test dimension= ', X_test.shape)
print('y_train dimension= ', y_train.shape)
print('y_train dimension= ', y_test.shape)
X_train dimension=  (404, 13)
X_test dimension=  (102, 13)
y_train dimension=  (404, 1)
y_train dimension=  (102, 1)

Now lets train the model using Random Forest Regressor

"""
To obtain a deterministic behaviour during fitting always set value for 'random_state' attribute
Also note that default value of criteria to split the data is 'mse' (mean squared error)
"""
rfr = ensemble.RandomForestRegressor(random_state= 1)
rfr.fit(X_train ,y_train.values.ravel())  # Using ravel() to convert column vector y to 1D array 
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=1, verbose=0, warm_start=False)

Testing The Model

  • For testing we are going to use the test data only
  • Question: Predict the values for every test set in test data
prediction = pd.DataFrame(rfr.predict(X_test), columns = ['prediction'])
# If you notice X_test index starts from 307, so we must reset the index so that we can combine it with prediction values
target = y_test.reset_index(drop=True) # dropping the original index column
target_vs_prediction = pd.concat([target,prediction],axis =1)
target_vs_prediction.T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
target 28.200 23.900 16.60 22.00 20.800 23.000 27.900 14.500 21.500 22.600 23.700 31.200 19.300 19.400 19.400 27.900 13.900 50.000 24.100 14.600 16.200 15.600 23.800 25.00 23.50 8.300 13.50 17.500 43.100 11.500 24.100 18.500 50.000 12.60 19.800 24.500 14.900 36.20 11.900 19.100 ... 8.500 14.500 23.700 37.200 41.700 16.500 21.700 22.700 23.000 10.500 21.900 21.000 20.400 21.800 50.000 22.000 23.300 37.300 18.000 19.200 34.900 13.400 22.900 22.500 13.000 24.600 18.300 18.100 23.900 50.000 13.600 22.900 10.900 18.900 22.400 22.900 44.800 21.700 10.200 15.400
prediction 30.016 27.473 20.03 20.43 19.754 19.652 27.466 19.151 20.272 23.288 29.018 30.391 20.428 20.396 20.479 24.365 12.379 40.837 24.253 14.179 19.939 15.854 24.264 23.91 25.61 9.537 14.58 19.781 43.808 12.196 26.003 19.588 47.472 16.14 23.495 20.884 15.468 33.67 13.158 20.028 ... 13.442 14.908 18.546 32.699 42.099 24.853 21.545 20.184 24.099 6.958 18.581 21.529 19.587 20.225 43.111 24.424 27.855 33.007 17.175 20.587 34.114 11.567 24.242 25.716 15.393 24.697 19.883 17.703 28.866 44.604 16.316 21.185 14.774 20.524 23.963 23.674 42.712 20.786 15.863 15.956

2 rows × 102 columns

Model Score

Check the model score using test data

rfr.score(X_test, y_test)
0.90948626473857

Note that for same dataset with decision tree algorithm, score was around 66% and now with Random Forest algorithm its 91%

2020

ANN Model to Classify Images

12 minute read

In this guide we are going to create and train the neural network model to classify the clothing images. We will use TensorFlow deep learning framework along...

Introduction to NLP

8 minute read

In short NLP is an AI technique used to do text analysis. Whenever we have lots of text data to analyze we can use NLP. Apart from text analysis, NLP also us...

K Fold Cross Validation

14 minute read

There are multiple ways to split the data for model training and testing, in this article we are going to cover K Fold and Stratified K Fold cross validation...

K-Means Clustering

13 minute read

K-Means clustering is most commonly used unsupervised learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters...

Time Series Analysis and Forecasting

10 minute read

Any data recorded with some fixed interval of time is called as time series data. This fixed interval can be hourly, daily, monthly or yearly. Objective of t...

Support Vector Machines

9 minute read

Support vector machines is one of the most powerful ‘Black Box’ machine learning algorithm. It belongs to the family of supervised learning algorithm. Used t...

Random Forest

11 minute read

Random forest is supervised learning algorithm and can be used to solve classification and regression problems. Unlike decision tree random forest fits multi...

Decision Tree

14 minute read

Decision tree explained using classification and regression example. The objective of decision tree is to split the data in such a way that at the end we hav...

Agile Scrum Framework

7 minute read

This tutorial covers basic Agile principles and use of Scrum framework in software development projects.

Underfitting & Overfitting

2 minute read

Main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on un...

Binary Logistic Regression Using Sklearn

5 minute read

In this tutorial we are going to use the Logistic Model from Sklearn library. We are also going to use the same test data used in Logistic Regression From Sc...

Train Test Split

3 minute read

In this tutorial we are going to study about train, test data split. We will use sklearn library to do the data split.

One Hot Encoding

11 minute read

In this tutorial we are going to study about One Hot Encoding. We will also use pandas and sklearn libraries to convert categorical data into numeric data.

Back to top ↑