Missing Data: Counting Blanks with Python

- December 03, 2023

We can easily see that a piece is missing here.

Last week, my friend and I began brushing up on our Data Analytics skills outside of our work hours, starting off with a Data Cleaning Tutorial on Kaggle.

The first section of this tutorial focused on Handling Missing Values, and although it was somewhat familiar to me, I still learned a lot.

In a previous post I focused on what to do when working with an incomplete dataset.

Today I will be focusing on how to find the number of missing values in a dataset.

Pick your path:

Go on an adventure to find the missing values manually.
Let Python do its thing

Now, while option one can work for small datasets like the example below, option two is useful when working with thousands of rows and columns.

It has been a year since I completed my Computer Science degree.

Initially, I assumed the process was as straightforward as finding the number of nulls using the .isnull() function and summing them with .sum().

However, to arrive at the correct answer, .sum().sum() is actually required.

That line of code looked something like this:

total_missing = dataframe.isnull().sum().sum()

If you find yourself confused, don't worry—I'll explain:

What does isnull().sum().sum() do?

Think of a DataFrame simply as another term for a table, where we have rows and columns.

The first .sum() examines all the columns in the DataFrame and returns the total count of nulls per column.

The second .sum() processes the row of values (a Series) and returns the overall total of nulls.

In essence, we use .sum().sum() to calculate the number of blanks in two dimensions: across rows and columns, a task we humans perform by simply looking at the grid!

Sample Code:


import pandas as pd
import numpy as np

# Creating a DataFrame with null values
data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [np.nan, 10, 11, 12]
}

df = pd.DataFrame(data)

# Counting the number of null values in the DataFrame

total_missing = df.isnull().sum().sum()

print("Number of null values:", total_missing)

# Calculate total number of cells in the DataFrame
total_cells = df.size

# Calculate the percentage of missing values
percent_missing = (total_missing / total_cells) * 100



Resources
Pandas API : DataFrame

Pandas API : Series

Kaggle Exercise

Blog 38 : Missing Data : What to Do?

2D Array : All You Need to Know

Search This Blog

Tech Talk with Toni

Missing Data: Counting Blanks with Python

What does isnull().sum().sum() do?

Resources

Comments

Post a Comment

Popular posts from this blog

0 to 100: A Reflection

Scheduling Algorithms

Learning Something New: EDA on Guitars

Make Your Screen Time Matter

Sharks, Dogs and Biases

Value Creation

The Algorithm : Musk's Mental Framework

Key Performance Indicators

Data-Driven Case Study : Barnes & Noble

Data Stacks: Google, Microsoft and Amazon