One Hot Encoding

one_hot_encoding_header.png

One of the most important thing while working on applied machine learning is well formatted data. We all know that how messy real world data can be. That is the reason why most of the time is spent on data preprocessing. Most of the machine learning models cannot operate if data is not in numeric format. That’s where One Hot Encoding come in picture. In short, it is a technique used to convert categorical text data into numeric format.

Note that One Hot Encoding is not a silver bullet that will convert any kind of text data from your dataset to numeric format. Its useful only with Categorical data.

What Is Categorical Data

As name suggest its data which can be divided into categories or groups. Examples of categorical data/variable are sex(male, female, other) and education levels(Graduate, Masters, PhD) If categorical variables don’t have any numeric order or relationship between them then they are called as Nominal Variables. For example sex is Nominal Categorical variable. On other hand if categorical variables have numeric order or relationship between them then they are called as Ordinal Variables. For example education level(graduate, Masters, PhD) is Ordinal Categorical variable.

How To Convert Categorical Data To Numeric Data

Now we know that we have to convert categorical data into numeric format so that our model can operate on it. There are two ways we can convert categorical data into numeric format. Label Encoding and One Hot Encoding.

Label Encoding:

  • It is also knows as ‘Integer Encoding’ because in this technique we simply assign numbers to each category. Numbering starts from 1 and then increase it for each category.

    label_encoding.png

Issue With Label Encoding

  • Label encoding only works with Ordinal variables where each category can be represented with numbers with some kind of order.
  • You have to also make sure to get that order right in order to avoid any prediction errors
  • Consider the example of sex categories where there is no natural order in categories. Machine learning model perform series of mathematical operation on given data in order to establish the relationship between input features. If model calculates the average between category ‘male’ and ‘other’ then we get (1+3)/2 = 2 which is same as label value of ‘female’. This is just an example you can imagine what will happen to the model when it finds such kind of correlation in data!

One Hot Encoding

  • In One Hot Encoding we use Binary Categorizing. We create separate column for each category and assign the binary value 1 or 0 to it.
  • It is most commonly used technique to convert categorical data in numeric format.
  • Since we create separate column with binary value for each category it avoids any false correlation between unrelated categories
  • Extra variables created for each category are called as Dummy Variables

    binary_encoding.png

Dummy Variable Trap

  • Dummy variable trap occurs when dummy variables are multicolinear with each other. That means one dummy variables value can be predicted using other dummy variables.
  • Remember that machine learning model perform series of mathematical operation on given data in order to establish the relationship between input features. And if there is multicolinearity between dummy variables it will affect the model performance.
  • Best way to avoid this is to drop one of the dummy variable column.

Python Code

Let’s see how to do One Hot Encoding using pandas and sklearn libraries using real world data.

Import the required libraries

  • pandas: Used for data manipulation and analysis. Here we are going to use ‘get_dummies()’ method for One Hot Encoding
  • numpy : Numpy is the core library for scientific computing in Python. It is used for working with arrays and matrices.
  • ColumnTransformer : Sklearn ColumnTransformer is used to apply data transform to different columns of dataset. Here we are using it to apply binary data transform to categorical data column.
  • OneHotEncoder : Sklearn OneHotEncoder for binary encoding
  • linear_model: Sklearn linear regression model
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model

Import the dataset

df = pd.read_csv('https://raw.githubusercontent.com/satishgunjal/datasets/master/Fish_Weight_One_Hot_Encoding.csv')
print('Dimension of dataset= ', df.shape)
print('Types of spcies= ', df.Species.unique()) # To get unique values from column
df.sample(5) # Display random 5 training examples
Dimension of dataset=  (42, 4)
Types of spcies=  ['Bream' 'Roach' 'Whitefish' 'Parkki' 'Perch' 'Pike' 'Smelt']
Species Weight Height Width
24 Perch 5.9 2.1120 1.4080
0 Bream 242.0 11.5200 4.0200
21 Parkki 120.0 8.3922 2.9181
15 Whitefish 540.0 10.7440 6.5620
22 Parkki 150.0 8.8928 3.2928

Understanding the dataset

  • There are total 42 rows(training samples) and 4 columns in dataset.
  • Each column details are as below
    • Species: Type of fish (‘Bream’ ‘Roach’ ‘Whitefish’ ‘Parkki’ ‘Perch’ ‘Pike’ ‘Smelt’)
    • Weight: Weight of fish in gram
    • Height: Height in CM
    • Width: Diagonal width in CM
  • Features/input values/independent variables are ‘Species’, ‘Height’ and ‘Width’
  • Target/output value/dependent variable is ‘Weight’

We can use above data to create a linear model to estimate the weight of the fish based on its measurement values. But since Species data is in text format either we have to drop it or convert it into numeric format.

Fish species is categorical variable. Means we can’t use label encoding here. We will use One Hot Encoding to convert fish species types into numeric format. And at the end we will also perform linear regression to test our dataset.

One Hot Encoding Using Pandas

Pandas get_dummies() method will create separate column for each category and assign binary value to it

dummies = pd.get_dummies(df.Species)
dummies
Bream Parkki Perch Pike Roach Smelt Whitefish
0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 1 0 0 0 0 0 0
6 0 0 0 0 1 0 0
7 0 0 0 0 1 0 0
8 0 0 0 0 1 0 0
9 0 0 0 0 1 0 0
10 0 0 0 0 1 0 0
11 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1
13 0 0 0 0 0 0 1
14 0 0 0 0 0 0 1
15 0 0 0 0 0 0 1
16 0 0 0 0 0 0 1
17 0 0 0 0 0 0 1
18 0 1 0 0 0 0 0
19 0 1 0 0 0 0 0
20 0 1 0 0 0 0 0
21 0 1 0 0 0 0 0
22 0 1 0 0 0 0 0
23 0 1 0 0 0 0 0
24 0 0 1 0 0 0 0
25 0 0 1 0 0 0 0
26 0 0 1 0 0 0 0
27 0 0 1 0 0 0 0
28 0 0 1 0 0 0 0
29 0 0 1 0 0 0 0
30 0 0 0 1 0 0 0
31 0 0 0 1 0 0 0
32 0 0 0 1 0 0 0
33 0 0 0 1 0 0 0
34 0 0 0 1 0 0 0
35 0 0 0 1 0 0 0
36 0 0 0 0 0 1 0
37 0 0 0 0 0 1 0
38 0 0 0 0 0 1 0
39 0 0 0 0 0 1 0
40 0 0 0 0 0 1 0
41 0 0 0 0 0 1 0

Note above, new dummy variables for each species and their binary values.

Let’s add newly created dummy variables to existing dataset

# pnadas conact method is used to merge two dataframes. 
df1 = pd.concat([df, dummies], axis='columns')
df1.sample(5)
Species Weight Height Width Bream Parkki Perch Pike Roach Smelt Whitefish
23 Parkki 140.0 8.5376 3.2944 0 1 0 0 0 0 0
0 Bream 242.0 11.5200 4.0200 1 0 0 0 0 0 0
19 Parkki 60.0 6.5772 2.3142 0 1 0 0 0 0 0
34 Pike 430.0 7.2900 4.5765 0 0 0 1 0 0 0
3 Bream 363.0 12.7300 4.4555 1 0 0 0 0 0 0

Since we have dummy variables for Species feature, we can drop ‘Species’ column and also to avoid the ‘Dummy Variable Trap’ we will drop ‘Whitefish’ column

df2 = df1.drop(['Species','Whitefish'], axis='columns')
df2.sample(5)
Weight Height Width Bream Parkki Perch Pike Roach Smelt
24 5.9 2.1120 1.4080 0 0 1 0 0 0
8 78.0 5.5756 2.9044 0 0 0 0 1 0
17 1000.0 12.3540 6.5250 0 0 0 0 0 0
23 140.0 8.5376 3.2944 0 1 0 0 0 0
19 60.0 6.5772 2.3142 0 1 0 0 0 0

Now this is our final dataset. We can use this for linear regression.

# Create feature matrix
X = df2.drop(['Weight'],axis = 'columns') 
# Create target vector
y = df2.Weight 

lm = linear_model.LinearRegression()
#Train the model using training data
lm.fit(X,y)
#Check model score
lm.score(X,y) 
# Note: We shouldnt use same dataset to check model score, this is out of scope of this tutorial.
0.9058731241968216

One Hot Encoding Using Sklearn Preprocessing

We will again start with original dataset ‘df’

df.head(5)
Species Weight Height Width
0 Bream 242.0 11.5200 4.0200
1 Bream 290.0 12.4800 4.3056
2 Bream 340.0 12.3778 4.6961
3 Bream 363.0 12.7300 4.4555
4 Bream 450.0 13.6024 4.9274
# creating one hot encoder object with categorical feature 0 indicating the first column of Species
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])],  remainder='passthrough')

# fit_transform() is combination of '.fit' and '.transform' command. .fit takes Species column and converts everything to numeric data and .tyransform just applies that conversion.
data = np.array(columnTransformer.fit_transform(df), dtype = np.str)
# Creating final dataframe using binary encoded sopecies dummy variables
df1 = pd.DataFrame(data, columns=['Bream','Parkki','Perch','Pike','Roach','Smelt','Whitefish','Weight','Height','Width'])
df1

Bream Parkki Perch Pike Roach Smelt Whitefish Weight Height Width
0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 242.0 11.52 4.02
1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 290.0 12.48 4.3056
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 340.0 12.3778 4.6961
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 363.0 12.73 4.4555
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 450.0 13.6024 4.9274
5 1.0 0.0 0.0 0.0 0.0 0.0 0.0 500.0 14.1795 5.2785
6 0.0 0.0 0.0 0.0 1.0 0.0 0.0 40.0 4.1472 2.2680000000000002
7 0.0 0.0 0.0 0.0 1.0 0.0 0.0 69.0 5.2983 2.8217
8 0.0 0.0 0.0 0.0 1.0 0.0 0.0 78.0 5.5756 2.9044
9 0.0 0.0 0.0 0.0 1.0 0.0 0.0 87.0 5.6166 3.1746
10 0.0 0.0 0.0 0.0 1.0 0.0 0.0 120.0 6.216 3.5742
11 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 6.4752 3.3516
12 0.0 0.0 0.0 0.0 0.0 0.0 1.0 270.0 8.3804 4.2476
13 0.0 0.0 0.0 0.0 0.0 0.0 1.0 270.0 8.1454 4.2485
14 0.0 0.0 0.0 0.0 0.0 0.0 1.0 306.0 8.777999999999999 4.6816
15 0.0 0.0 0.0 0.0 0.0 0.0 1.0 540.0 10.744000000000002 6.562
16 0.0 0.0 0.0 0.0 0.0 0.0 1.0 800.0 11.7612 6.5736
17 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1000.0 12.354000000000001 6.525
18 0.0 1.0 0.0 0.0 0.0 0.0 0.0 55.0 6.8475 2.3265
19 0.0 1.0 0.0 0.0 0.0 0.0 0.0 60.0 6.5772 2.3142
20 0.0 1.0 0.0 0.0 0.0 0.0 0.0 90.0 7.4052 2.673
21 0.0 1.0 0.0 0.0 0.0 0.0 0.0 120.0 8.3922 2.9181
22 0.0 1.0 0.0 0.0 0.0 0.0 0.0 150.0 8.8928 3.2928
23 0.0 1.0 0.0 0.0 0.0 0.0 0.0 140.0 8.5376 3.2944
24 0.0 0.0 1.0 0.0 0.0 0.0 0.0 5.9 2.112 1.4080000000000001
25 0.0 0.0 1.0 0.0 0.0 0.0 0.0 32.0 3.528 1.9992
26 0.0 0.0 1.0 0.0 0.0 0.0 0.0 40.0 3.824 2.432
27 0.0 0.0 1.0 0.0 0.0 0.0 0.0 51.5 4.5924 2.6316
28 0.0 0.0 1.0 0.0 0.0 0.0 0.0 70.0 4.588 2.9415
29 0.0 0.0 1.0 0.0 0.0 0.0 0.0 100.0 5.2224 3.3216
30 0.0 0.0 0.0 1.0 0.0 0.0 0.0 200.0 5.568 3.3756
31 0.0 0.0 0.0 1.0 0.0 0.0 0.0 300.0 5.7078 4.158
32 0.0 0.0 0.0 1.0 0.0 0.0 0.0 300.0 5.9364 4.3844
33 0.0 0.0 0.0 1.0 0.0 0.0 0.0 300.0 6.2884 4.0198
34 0.0 0.0 0.0 1.0 0.0 0.0 0.0 430.0 7.29 4.5765
35 0.0 0.0 0.0 1.0 0.0 0.0 0.0 345.0 6.396 3.977
36 0.0 0.0 0.0 0.0 0.0 1.0 0.0 6.7 1.7388 1.0476
37 0.0 0.0 0.0 0.0 0.0 1.0 0.0 7.5 1.972 1.16
38 0.0 0.0 0.0 0.0 0.0 1.0 0.0 7.0 1.7284 1.1484
39 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.7 2.1959999999999997 1.38
40 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.8 2.0832 1.2772
41 0.0 0.0 0.0 0.0 0.0 1.0 0.0 8.7 1.9782 1.2852

Note above, new dummy variables for each species and their binary values.

To avoid the ‘Dummy Variable Trap’ we will drop ‘Whitefish’ column

df2 = df1.drop(['Whitefish'], axis = 'columns')
df2.sample(10)
Bream Parkki Perch Pike Roach Smelt Weight Height Width
29 0.0 0.0 1.0 0.0 0.0 0.0 100.0 5.2224 3.3216
8 0.0 0.0 0.0 0.0 1.0 0.0 78.0 5.5756 2.9044
4 1.0 0.0 0.0 0.0 0.0 0.0 450.0 13.6024 4.9274
7 0.0 0.0 0.0 0.0 1.0 0.0 69.0 5.2983 2.8217
3 1.0 0.0 0.0 0.0 0.0 0.0 363.0 12.73 4.4555
10 0.0 0.0 0.0 0.0 1.0 0.0 120.0 6.216 3.5742
34 0.0 0.0 0.0 1.0 0.0 0.0 430.0 7.29 4.5765
18 0.0 1.0 0.0 0.0 0.0 0.0 55.0 6.8475 2.3265
15 0.0 0.0 0.0 0.0 0.0 0.0 540.0 10.744000000000002 6.562
17 0.0 0.0 0.0 0.0 0.0 0.0 1000.0 12.354000000000001 6.525

Now, our final dataset is ready we can perform the linear regression.

# Create feature matrix
X = df2.drop(['Weight'],axis = 'columns') 
# Create target vector
y = df2.Weight 

lm = linear_model.LinearRegression()
#TRain the model
lm.fit(X,y)
lm.score(X,y) 
# Note: We shouldnt use same dataset to check model score, this is out of scope of this tutorial.
0.9058731241968216

This is how we can use pandas and sklearn library for performing the One Hot Encoding

2020

ANN Model to Classify Images

12 minute read

In this guide we are going to create and train the neural network model to classify the clothing images. We will use TensorFlow deep learning framework along...

Introduction to NLP

8 minute read

In short NLP is an AI technique used to do text analysis. Whenever we have lots of text data to analyze we can use NLP. Apart from text analysis, NLP also us...

K Fold Cross Validation

14 minute read

There are multiple ways to split the data for model training and testing, in this article we are going to cover K Fold and Stratified K Fold cross validation...

K-Means Clustering

13 minute read

K-Means clustering is most commonly used unsupervised learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters...

Time Series Analysis and Forecasting

10 minute read

Any data recorded with some fixed interval of time is called as time series data. This fixed interval can be hourly, daily, monthly or yearly. Objective of t...

Support Vector Machines

9 minute read

Support vector machines is one of the most powerful ‘Black Box’ machine learning algorithm. It belongs to the family of supervised learning algorithm. Used t...

Random Forest

12 minute read

Random forest is supervised learning algorithm and can be used to solve classification and regression problems. Unlike decision tree random forest fits multi...

Decision Tree

13 minute read

Decision tree explained using classification and regression example. The objective of decision tree is to split the data in such a way that at the end we hav...

Agile Scrum Framework

7 minute read

This tutorial covers basic Agile principles and use of Scrum framework in software development projects.

Underfitting & Overfitting

2 minute read

Main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on un...

Binary Logistic Regression Using Sklearn

6 minute read

In this tutorial we are going to use the Logistic Model from Sklearn library. We are also going to use the same test data used in Logistic Regression From Sc...

Train Test Split

3 minute read

In this tutorial we are going to study about train, test data split. We will use sklearn library to do the data split.

One Hot Encoding

11 minute read

In this tutorial we are going to study about One Hot Encoding. We will also use pandas and sklearn libraries to convert categorical data into numeric data.

Back to top ↑