Missing Data: Counting Blanks with Python

 


We can easily see that a piece is missing here. 

Last week, my friend and I began brushing up on our Data Analytics skills outside of our work hours, starting off with a Data Cleaning Tutorial on Kaggle.

The first section of this tutorial focused on Handling Missing Values, and although it was somewhat familiar to me, I still learned a lot.

In a previous post I focused on what to do when working with an incomplete dataset.

Today I will be focusing on how to find the number of missing values in a dataset.

Pick your path:

  1. Go on an adventure to find the missing values manually.
  2. Let Python do its thing

Now, while option one can work for small datasets like the example below, option two is useful when working with thousands of rows and columns.

It has been a year since I completed my Computer Science degree. 

Initially, I assumed the process was as straightforward as finding the number of nulls using the .isnull() function and summing them with .sum(). 

However, to arrive at the correct answer, .sum().sum() is actually required.

That line of code looked something like this: 

total_missing = dataframe.isnull().sum().sum()

If you find yourself confused, don't worry—I'll explain:

What does isnull().sum().sum() do?

Think of a DataFrame simply as another term for a table, where we have rows and columns.




The first .sum() examines all the columns in the DataFrame and returns the total count of nulls per column. 





The second .sum() processes the row of values (a Series) and returns the overall total of nulls.





In essence, we use .sum().sum() to calculate the number of blanks in two dimensions: across rows and columns,  a task we humans perform by simply looking at the grid!

Sample Code:

import pandas as pd
import numpy as np

# Creating a DataFrame with null values
data = {
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [np.nan, 10, 11, 12]
}

df = pd.DataFrame(data)

# Counting the number of null values in the DataFrame

total_missing = df.isnull().sum().sum()

print("Number of null values:", total_missing)

# Calculate total number of cells in the DataFrame total_cells = df.size # Calculate the percentage of missing values percent_missing = (total_missing / total_cells) * 100



Resources













Comments

Popular posts from this blog

Missing Data : What to Do?

Prompt Engineering : An Introduction

Upskilling: Certificates vs. Certifications

Women In STEM : Challenges and Advantages

SQL Server Reporting Services vs. Power BI

5 Authentication Methods

There Has Been a Data Breach: Now What?

Inductive and Deductive Reasoning

Improving SQL Query Performance : Indexes

Don't Be Bland : Spice Up Your Personal Brand