Data Management: Handling Dupes

- July 21, 2024

Duplication refers to a type of mutation in which one or more copies of a DNA segment is produced

Today's post is inspired by a course I recently completed on DataCamp around Cleaning Data in SQL.

In the context of Data Quality, we are often interested in a dimension known as "Uniqueness", which can be thought of as having one copy of each record in a table.

Now you may be asking yourself, "Why is duplication bad?"

To that, I would say that it's expensive to store copies, and it also skews data analysis.

For example, if duplicate records exist, a customer might receive multiple marketing emails for the same promotion, leading to frustration and a poor customer experience.

Now, suppose we have a table named orders, to identify these duplicate records, we can use a common SQL technique with the ROW_NUMBER() window function.


WITH CTE AS (
    SELECT 
        order_id,
        customer_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (PARTITION BY customer_id, order_date, amount
                           ORDER BY order_id) AS row_num
    FROM orders
   )

SELECT *
FROM CTE
WHERE row_num > 1;

In this query, we use a Common Table Expression (CTE) to assign a unique row number to each record within each partition of customer_id, order_date, and amount. Any record with a row_num greater than 1 is considered a duplicate.

Pro Tips

Before deleting duplicates in a database consider the following to ensure that data integrity is maintained and no essential information is lost:

Clearly define the criteria for what constitutes a duplicate.
When duplicates are identified, decide on a strategy to determine which record to keep.
Test the process on a subset of the data or in a staging environment.

Search This Blog

Tech Talk with Toni