Data Preprocessing : One-Hot Encoding

 


A single binary digit is called a "bit"


Prior to conducting any data analysis, the initial step typically involves data preprocessing.

Data cleaning is an essential phase in data analysis because real-world datasets tend to be incomplete and messy.

Think of it as washing and peeling potatoes before putting it to boil. 

This process is crucial because the insights and analysis are only as good as the data you are using.

Without proper input: Garbage In, Garbage Out.

I first came across one-hot encoding in my final year at UWI in COMP 3610: Big Data Analytics in a lab session.

The concept of one-hot encoding revolves around the transformation of qualitative data into a binary format.

"Why is it called one-hot encoding?"

One-hot encoding is named as such because it represents categorical data using a binary format where only one bit is "hot" or active (set to 1) at a time.


Let's consider an example of one-hot encoding the cap colors of mushrooms:

Original Dataset

Mushroom

Cap Colour

Mushroom 1

Red

Mushroom 2

Brown

Mushroom 3

White


One-Hot Encoded Dataset

Mushroom

Red

Brown

White

Mushroom 1

🔥

0

0

Mushroom 2

0

🔥

0

Mushroom 3

0

0

🔥


🔥: This is actually a '1'

  • For each category, a new binary column is created. 
  • If the data point belongs to a specific category, the corresponding binary column is set to 1 (hot)
  • All other binary columns for other categories are set to 0 (not hot).

With that being said, here are points to take into consideration before using this technique:

Dimensionality Increase 

One-hot encoding can significantly increase the dimensionality of the dataset, especially when dealing with categorical variables having many unique categories. In the example above, we went from one column holding information on cap colour to having three columns. 

Sparse Matrices

The resulting one-hot encoded data can lead to sparse matrices, with mostly zero values and a few ones. Sparse data can require more memory and processing time during training, especially when dealing with large datasets.

Irrelevant Categories

If our model were required to predict whether a mushroom is poisonous or not (its class), based on over 20 features specific to that mushroom, it is reasonable to assume that a strong correlation might not exist for every feature in influencing that decision, as depicted in the heatmap below.


Comments

Popular posts from this blog

Missing Data : What to Do?

Prompt Engineering : An Introduction

Upskilling: Certificates vs. Certifications

Women In STEM : Challenges and Advantages

SQL Server Reporting Services vs. Power BI

5 Authentication Methods

There Has Been a Data Breach: Now What?

Inductive and Deductive Reasoning

Improving SQL Query Performance : Indexes

Don't Be Bland : Spice Up Your Personal Brand