Data Management: Handling Dupes
Duplication refers to a type of mutation in which one or more copies of a DNA segment is produced
Today's post is inspired by a course I recently completed on DataCamp around Cleaning Data in SQL.
In the context of Data Quality, we are often interested in a dimension known as "Uniqueness", which can be thought of as having one copy of each record in a table.
Now you may be asking yourself, "Why is duplication bad?"
To that, I would say that it's expensive to store copies, and it also skews data analysis.
For example, if duplicate records exist, a customer might receive multiple marketing emails for the same promotion, leading to frustration and a poor customer experience.
Now, suppose we have a table named orders, to identify these duplicate records, we can use a common SQL technique with the ROW_NUMBER() window function.
WITH CTE AS (
SELECT
order_id,
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id, order_date, amount
ORDER BY order_id) AS row_num
FROM orders
)
SELECT *
FROM CTE
WHERE row_num > 1;
In this query, we use a Common Table Expression (CTE) to assign a unique row number to each record within each partition of customer_id, order_date, and amount. Any record with a row_num greater than 1 is considered a duplicate.
Pro Tips
Before deleting duplicates in a database consider the following to ensure that data integrity is maintained and no essential information is lost:
- Clearly define the criteria for what constitutes a duplicate.
- When duplicates are identified, decide on a strategy to determine which record to keep.
- Test the process on a subset of the data or in a staging environment.
Comments
Post a Comment