Data Preprocessing : One-Hot Encoding
A single binary digit is called a "bit"
Prior to conducting any data analysis, the initial step typically involves data preprocessing.
Data cleaning is an essential phase in data analysis because real-world datasets tend to be incomplete and messy.
Think of it as washing and peeling potatoes before putting it to boil.
This process is crucial because the insights and analysis are only as good as the data you are using.
Without proper input: Garbage In, Garbage Out.
I first came across one-hot encoding in my final year at UWI in COMP 3610: Big Data Analytics in a lab session.
The concept of one-hot encoding revolves around the transformation of qualitative data into a binary format.
"Why is it called one-hot encoding?"
One-hot encoding is named as such because it represents categorical data using a binary format where only one bit is "hot" or active (set to 1) at a time.
Let's consider an example of one-hot encoding the cap colors of mushrooms:
Original Dataset
One-Hot Encoded Dataset
- For each category, a new binary column is created.
- If the data point belongs to a specific category, the corresponding binary column is set to 1 (hot).
- All other binary columns for other categories are set to 0 (not hot).
With that being said, here are points to take into consideration before using this technique:
Dimensionality Increase
One-hot encoding can significantly increase the dimensionality of the dataset, especially when dealing with categorical variables having many unique categories. In the example above, we went from one column holding information on cap colour to having three columns.
Sparse Matrices
The resulting one-hot encoded data can lead to sparse matrices, with mostly zero values and a few ones. Sparse data can require more memory and processing time during training, especially when dealing with large datasets.
Irrelevant Categories
If our model were required to predict whether a mushroom is poisonous or not (its class), based on over 20 features specific to that mushroom, it is reasonable to assume that a strong correlation might not exist for every feature in influencing that decision, as depicted in the heatmap below.
Comments
Post a Comment