Data Management: Handling Dupes


Duplication refers to a type of mutation in which one or more copies of a DNA segment is produced
 
Today's post is inspired by a course I recently completed on DataCamp around Cleaning Data in SQL.

In the context of Data Quality, we are often interested in a dimension known as "Uniqueness", which can be thought of as having one copy of each record in a table. 

Now you may be asking yourself, "Why is duplication bad?" 

To that, I would say that it's expensive to store copies, and it also skews data analysis.

For example,  if duplicate records exist, a customer might receive multiple marketing emails for the same promotion, leading to frustration and a poor customer experience.

Now, suppose we have a table named orders, to identify these duplicate records, we can use a common SQL technique with the ROW_NUMBER() window function. 


WITH CTE AS (
    SELECT 
        order_id,
        customer_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (PARTITION BY customer_id, order_date, amount
                           ORDER BY order_id) AS row_num
    FROM orders
   )

SELECT *
FROM CTE
WHERE row_num > 1;

In this query, we use a Common Table Expression (CTE) to assign a unique row number to each record within each partition of customer_id, order_date, and amount. Any record with a row_num greater than 1 is considered a duplicate.

Pro Tips

Before deleting duplicates in a database consider the following to ensure that data integrity is maintained and no essential information is lost:

  • Clearly define the criteria for what constitutes a duplicate.
  • When duplicates are identified, decide on a strategy to determine which record to keep. 
  • Test the process on a subset of the data or in a staging environment.


Resources






Comments

Popular posts from this blog

Prompt Engineering : An Introduction

Women In STEM : Challenges and Advantages

5 Authentication Methods

Inductive and Deductive Reasoning

Don't Be Bland : Spice Up Your Personal Brand

3 Common Diseases Associated With Sitting All Day

Coding Best Practices : Error Messages Are Friends, Not Foes.

Upskilling: Certificates vs. Certifications

There Has Been a Data Breach: Now What?

Scheduling Algorithms