Data Wrangling : Best Practices For Working With Big Datasets

- April 21, 2023

A matrix of dots that you did not bother to count

Let's face it, working with exponentially expanding datasets can be both exciting and overwhelming.

Imagine dealing with a dataset of 22 million rows - that's a lot of information to process!

The question is, can your ETL process handle it?

This is a problem that I faced this week, and I had to find a way to improve the performance of the process that updates a dashboard as it was taking a decade to update.

The initial question that crossed my mind was, "What was the actual size of this dataset?"

At first, I mistakenly assumed that the dataset contained between 1 and 6 million rows.

A quick COUNT(*) query made it clear that my estimate was way off and also provided me with some clarity to the problem.

That dataset was a behemoth, it probably had its own gravitational pull!

The sad truth was that the process was not scalable, and it was clear that immediate improvements were necessary.

Here are a three tips that I have learned when dealing with such massive datasets:

1) Optimize Your Code

If you're using SQL scripts, like I was, consider using indexes, selecting only relevant fields, and using stored procedures to improve query performance.

2) Employ Data Partitioning

Partition your data into smaller, more manageable chunks. This can help reduce the processing load on individual nodes and improve performance.

3) Use Cloud-Based Services

This not only helps reduce the burden on your own hardware, but also ensures that you have the computing power you need to get the job done.

Resources

From 1 to 1,000,000

SQL Query Optimization

60-Second Video On Partitioning

Book on Big Data

Top 6 Big Data Cloud Storage Providers

Search This Blog

Tech Talk with Toni