Validating Email Address Format with Python's Great Expectations

 


Your email address is valid!



This week, I had the task of gathering a client contact list, but I ran into some issues with certain email addresses.

Specifically, I noticed that some email addresses were missing the '@' symbol, while others were lacking the domain name.

To ensure the integrity of our client contact list, it's crucial that all email addresses follow the expected format: username@domain.com. This way, when we use this data for its intended purpose – sending emails to customers regarding promotions, updates, or other important communications – we can minimize the number of undeliverable emails.

I've already taken steps to address this issue by removing rows that don't adhere to this format using Alteryx. 

Today, I'm excited to show you how we can use Great Expectations on Google Colaboratory to identify  erroneous emails.

I want to give a special shout-out to our Product Manager, Ikechi Griffith, for challenging me to dive deeper into Great Expectations.

I started my day bright and early, as usual, and created a sample dataset with some problematic entries:


IDNameAgeSalaryGenderRegistration_DateEmail Address
1Alice2590KUnknown1/15/2022john.doe123example.com
2Bob3975000Female12/25/2021sarah.smith456@gmail.com
3Charlie2375000UnknownInvalidmike.jones789@yahoo.com
4David3875000Male3/10/2022emily.wilson101@hotmail
5Eve4280kMale2/5/2023david.miller2022@outlook.com
6Frank4280kUnknown11/30/2020lisa.johnson22@aol.com
7Grace3460000Unknown9/20/2021robert.brown777@icloud.com
8Helen3990KUnknown6/15/2022lauren.white888@protonmail.com
9Ivy3675000Female4/2/2022chris.jackson55@yandex.com
10Jack4275000Female8/12/2021anna.martin999@gmx.com


I have not used this library before, so I had to familiarize myself with the documentation and watch a couple of videos, which I will link below.

With that being said, here are the steps taken to determine which emails in the dataset are structurally invalid. 

Install Great Expectations 

First, you need to install the Great Expectations library in your Google Colab environment.

!pip install great-expectations

Instantiate Data Context + Read CSV

Instantiating a data context in Great Expectations is a crucial step when working with the framework, as it provides the foundation for managing and organizing your data validation and testing processes. 

  
    import great_expectations as gx

    context = gx.get_context()
    df = gx.read_csv("/content/non_conforming_dataset.csv")
  
 

Create Expectation 

In Great Expectations, you can create expectations using the expect_* methods, where '*' represents the type of expectation you want to set. In this case, I used expect_column_values_to_match_regex as it's an easy way to identify email addresses in the correct format

    email_regex = r'^[\w\.-]+@[\w\.-]+\.\w+$'


    df.expect_column_values_to_match_regex('Email Address', email_regex)
  

Output  

The check discovered that out of 10 clients, 2 didn't quite fit the expected pattern for the "Email Address" column. 
  
{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_match_regex",
    "kwargs": {
      "column": "Email Address",
      "regex": "^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$",
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 2,
    "unexpected_percent": 20.0,
    "unexpected_percent_total": 20.0,
    "unexpected_percent_nonmissing": 20.0,
    "partial_unexpected_list": [
      "john.doe123example.com",
      "emily.wilson101@hotmail"
    ]
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}
  

Resources




Comments

Popular posts from this blog

Prompt Engineering : An Introduction

Women In STEM : Challenges and Advantages

5 Authentication Methods

Inductive and Deductive Reasoning

Don't Be Bland : Spice Up Your Personal Brand

3 Common Diseases Associated With Sitting All Day

Coding Best Practices : Error Messages Are Friends, Not Foes.

Upskilling: Certificates vs. Certifications

There Has Been a Data Breach: Now What?

Scheduling Algorithms