Tech Talk with Toni

Posts

Showing posts with the label Uniqueness

Data Management: Handling Dupes

- July 21, 2024

Duplication refers to a type of mutation in which one or more copies of a DNA segment is produced Today's post is inspired by a course I recently completed on DataCamp around Cleaning Data in SQL. In the context of Data Quality, we are often interested in a dimension known as "Uniqueness", which can be thought of as having one copy of each record in a table. Now you may be asking yourself, "Why is duplication bad?" To that, I would say that it's expensive to store copies, and it also skews data analysis. For example, if duplicate records exist, a customer might receive multiple marketing emails for the same promotion, leading to frustration and a poor customer experience. Now, suppose we have a table named orders, to identify these duplicate records, we can use a common SQL technique with the ROW_NUMBER() window function. WITH CTE AS ( SELECT order_id, customer_id, order_date, amount, ROW_NUMBER() OVER (PAR...