Validating Email Address Format with Python's Great Expectations

- September 16, 2023

Your email address is valid!

This week, I had the task of gathering a client contact list, but I ran into some issues with certain email addresses.

Specifically, I noticed that some email addresses were missing the '@' symbol, while others were lacking the domain name.

To ensure the integrity of our client contact list, it's crucial that all email addresses follow the expected format: username@domain.com. This way, when we use this data for its intended purpose – sending emails to customers regarding promotions, updates, or other important communications – we can minimize the number of undeliverable emails.

I've already taken steps to address this issue by removing rows that don't adhere to this format using Alteryx.

Today, I'm excited to show you how we can use Great Expectations on Google Colaboratory to identify erroneous emails.

I want to give a special shout-out to our Product Manager, Ikechi Griffith, for challenging me to dive deeper into Great Expectations.

I started my day bright and early, as usual, and created a sample dataset with some problematic entries:

ID	Name	Age	Salary	Gender	Registration_Date	Email Address
1	Alice	25	90K	Unknown	1/15/2022	john.doe123example.com
2	Bob	39	75000	Female	12/25/2021	sarah.smith456@gmail.com
3	Charlie	23	75000	Unknown	Invalid	mike.jones789@yahoo.com
4	David	38	75000	Male	3/10/2022	emily.wilson101@hotmail
5	Eve	42	80k	Male	2/5/2023	david.miller2022@outlook.com
6	Frank	42	80k	Unknown	11/30/2020	lisa.johnson22@aol.com
7	Grace	34	60000	Unknown	9/20/2021	robert.brown777@icloud.com
8	Helen	39	90K	Unknown	6/15/2022	lauren.white888@protonmail.com
9	Ivy	36	75000	Female	4/2/2022	chris.jackson55@yandex.com
10	Jack	42	75000	Female	8/12/2021	anna.martin999@gmx.com

Show per page

I have not used this library before, so I had to familiarize myself with the documentation and watch a couple of videos, which I will link below.

With that being said, here are the steps taken to determine which emails in the dataset are structurally invalid.

Install Great Expectations

First, you need to install the Great Expectations library in your Google Colab environment.

!pip install great-expectations

Instantiate Data Context + Read CSV

Instantiating a data context in Great Expectations is a crucial step when working with the framework, as it provides the foundation for managing and organizing your data validation and testing processes.

  
    import great_expectations as gx


    context = gx.get_context()

    df = gx.read_csv("/content/non_conforming_dataset.csv")

Create Expectation

In Great Expectations, you can create expectations using the expect_* methods, where '*' represents the type of expectation you want to set. In this case, I used expect_column_values_to_match_regex as it's an easy way to identify email addresses in the correct format

    email_regex = r'^[\w\.-]+@[\w\.-]+\.\w+$'


    df.expect_column_values_to_match_regex('Email Address', email_regex)

Output

The check discovered that out of 10 clients, 2 didn't quite fit the expected pattern for the "Email Address" column.

  
{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_match_regex",
    "kwargs": {
      "column": "Email Address",
      "regex": "^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$",
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 2,
    "unexpected_percent": 20.0,
    "unexpected_percent_total": 20.0,
    "unexpected_percent_nonmissing": 20.0,
    "partial_unexpected_list": [
      "john.doe123example.com",
      "emily.wilson101@hotmail"
    ]
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Resources

Quick Start : Great Expectations

Explore Expectations

Quick Tutorial: Great Expectations

Regular Expressions

Generate Dummy Data for Testing : Mockaroo

Search This Blog

Tech Talk with Toni