Data Preprocessing : One-Hot Encoding

- August 19, 2023

A single binary digit is called a "bit"

Prior to conducting any data analysis, the initial step typically involves data preprocessing.

Data cleaning is an essential phase in data analysis because real-world datasets tend to be incomplete and messy.

Think of it as washing and peeling potatoes before putting it to boil.

This process is crucial because the insights and analysis are only as good as the data you are using.

Without proper input: Garbage In, Garbage Out.

I first came across one-hot encoding in my final year at UWI in COMP 3610: Big Data Analytics in a lab session.

The concept of one-hot encoding revolves around the transformation of qualitative data into a binary format.

"Why is it called one-hot encoding?"

One-hot encoding is named as such because it represents categorical data using a binary format where only one bit is "hot" or active (set to 1) at a time.

Let's consider an example of one-hot encoding the cap colors of mushrooms:

Original Dataset

Mushroom	Cap Colour
Mushroom 1	Red
Mushroom 2	Brown
Mushroom 3	White

One-Hot Encoded Dataset

Mushroom	Red	Brown	White
Mushroom 1	🔥	0	0
Mushroom 2	0	🔥	0
Mushroom 3	0	0	🔥

🔥: This is actually a '1'

For each category, a new binary column is created.
If the data point belongs to a specific category, the corresponding binary column is set to 1 (hot).
All other binary columns for other categories are set to 0 (not hot).

With that being said, here are points to take into consideration before using this technique:

Dimensionality Increase

One-hot encoding can significantly increase the dimensionality of the dataset, especially when dealing with categorical variables having many unique categories. In the example above, we went from one column holding information on cap colour to having three columns.

Sparse Matrices

The resulting one-hot encoded data can lead to sparse matrices, with mostly zero values and a few ones. Sparse data can require more memory and processing time during training, especially when dealing with large datasets.

Irrelevant Categories

If our model were required to predict whether a mushroom is poisonous or not (its class), based on over 20 features specific to that mushroom, it is reasonable to assume that a strong correlation might not exist for every feature in influencing that decision, as depicted in the heatmap below.

Resources

Kaggle Notebook : One Hot Encoding: Everything You Need To Know

StatQuest : One-Hot Encoding Video

Neptune.Ai: Article on More Preprocessing Techniques

Mushroom Dataset

Mycologist Answers Mushroom Questions From Twitter

Search This Blog

Tech Talk with Toni