Missing Data: Counting Blanks with Python
Last week, my friend and I began brushing up on our Data Analytics skills outside of our work hours, starting off with a Data Cleaning Tutorial on Kaggle.
The first section of this tutorial focused on Handling Missing Values, and although it was somewhat familiar to me, I still learned a lot.
In a previous post I focused on what to do when working with an incomplete dataset.
Today I will be focusing on how to find the number of missing values in a dataset.
Pick your path:
- Go on an adventure to find the missing values manually.
- Let Python do its thing
It has been a year since I completed my Computer Science degree.
Initially, I assumed the process was as straightforward as finding the number of nulls using the .isnull() function and summing them with .sum().
However, to arrive at the correct answer, .sum().sum() is actually required.
That line of code looked something like this:
total_missing = dataframe.isnull().sum().sum()
If you find yourself confused, don't worry—I'll explain:
What does isnull().sum().sum() do?
Think of a DataFrame simply as another term for a table, where we have rows and columns.
import pandas as pdimport numpy as np
# Creating a DataFrame with null valuesdata = { 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8], 'C': [np.nan, 10, 11, 12]}
df = pd.DataFrame(data)
# Counting the number of null values in the DataFrame
total_missing = df.isnull().sum().sum()
print("Number of null values:", total_missing)
# Calculate total number of cells in the DataFrame
total_cells = df.size
# Calculate the percentage of missing values
percent_missing = (total_missing / total_cells) * 100
Resources
Comments
Post a Comment