Missing Data : What to Do?

 


A missing tooth: that's not good (or is it?).

Today's topic is inspired by a problem that one of my coworkers brought to my attention, and I think this is also something many people face on a day-to-day basis.

As data professionals, we have all encountered datasets that contain fields with no data, which can be frustrating when building reports or machine learning models.

Since we all know that a perfect dataset does not exist in the real world, the next best thing to do would be to have a game plan in place to tackle the inevitable.

When posed with the question, "Toni, how should I go about solving this?" the first word that comes to mind is context.

You see, just because something is missing doesn't mean it needs to be "fixed" for example, knowing why a tooth a missing is important before acting. 

A 6-year old with a missing tooth doesn't require a filling. 

With that being said, here is a strategy for dealing with the unknown:

Why is the data not there?

The data might not be available for client information for two main reasons:

The field is not mandatory: Some information may be missing because users are not required to fill in all the details, and they might skip some fields.

Imported and not regularly updated: Data might also be absent because it comes from another system and isn't consistently updated.

To get a clearer picture, it's a good idea to speak with the people who manage this process to understand the exact reasons. Having this information in advance will help you respond to user inquiries more effectively.

What percentage of the data is not there?

For small percentages of missing data, you can use data imputation techniques to fill in the gaps.

If a particular variable has a very high percentage of missing data, it's advisable to exclude that variable from your analysis. This helps prevent the introduction of bias or noise into your results.

What if the primary key is missing?

Reconstruct if Possible: Look for clues in the data to recreate the missing primary key. For instance, if you're missing an order number, you might use a combination of other details like customer name, date, and product to create a temporary key.

Use an Artificial Key: In some cases, you can create a new, unique identifier to serve as the primary key. This artificial key can be a sequential number or a randomly generated code.


Resources



Comments

Popular posts from this blog

Prompt Engineering : An Introduction

Upskilling: Certificates vs. Certifications

Women In STEM : Challenges and Advantages

SQL Server Reporting Services vs. Power BI

5 Authentication Methods

There Has Been a Data Breach: Now What?

Inductive and Deductive Reasoning

Improving SQL Query Performance : Indexes

Don't Be Bland : Spice Up Your Personal Brand