Validating Email Address Format with Python's Great Expectations
Your email address is valid!
This week, I had the task of gathering a client contact list, but I ran into some issues with certain email addresses.
Specifically, I noticed that some email addresses were missing the '@' symbol, while others were lacking the domain name.
To ensure the integrity of our client contact list, it's crucial that all email addresses follow the expected format: username@domain.com. This way, when we use this data for its intended purpose – sending emails to customers regarding promotions, updates, or other important communications – we can minimize the number of undeliverable emails.
I've already taken steps to address this issue by removing rows that don't adhere to this format using Alteryx.
Today, I'm excited to show you how we can use Great Expectations on Google Colaboratory to identify erroneous emails.
I want to give a special shout-out to our Product Manager, Ikechi Griffith, for challenging me to dive deeper into Great Expectations.
I started my day bright and early, as usual, and created a sample dataset with some problematic entries:
ID | Name | Age | Salary | Gender | Registration_Date | Email Address |
---|---|---|---|---|---|---|
I have not used this library before, so I had to familiarize myself with the documentation and watch a couple of videos, which I will link below.
With that being said, here are the steps taken to determine which emails in the dataset are structurally invalid.
Install Great Expectations
First, you need to install the Great Expectations library in your Google Colab environment.
!pip install great-expectations
Instantiate Data Context + Read CSV
Instantiating a data context in Great Expectations is a crucial step when working with the framework, as it provides the foundation for managing and organizing your data validation and testing processes.
import great_expectations as gx
context = gx.get_context()
df = gx.read_csv("/content/non_conforming_dataset.csv")
Create Expectation
In Great Expectations, you can create expectations using the expect_* methods, where '*' represents the type of expectation you want to set. In this case, I used expect_column_values_to_match_regex as it's an easy way to identify email addresses in the correct format
email_regex = r'^[\w\.-]+@[\w\.-]+\.\w+$'
df.expect_column_values_to_match_regex('Email Address', email_regex)
Output
The check discovered that out of 10 clients, 2 didn't quite fit the expected pattern for the "Email Address" column.
{
"success": false,
"expectation_config": {
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {
"column": "Email Address",
"regex": "^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$",
"result_format": "BASIC"
},
"meta": {}
},
"result": {
"element_count": 10,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 2,
"unexpected_percent": 20.0,
"unexpected_percent_total": 20.0,
"unexpected_percent_nonmissing": 20.0,
"partial_unexpected_list": [
"john.doe123example.com",
"emily.wilson101@hotmail"
]
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}
Comments
Post a Comment