Posts

Showing posts with the label Uniqueness

Data Management: Handling Dupes

Image
Duplication refers to a type of mutation in which one or more copies of a DNA segment is produced   Today's post is inspired by a course I recently completed on DataCamp around Cleaning Data in SQL. In the context of Data Quality, we are often interested in a dimension known as "Uniqueness", which can be thought of as having one copy of each record in a table.  Now you may be asking yourself, "Why is duplication bad?"  To that, I would say that it's expensive to store copies, and it also skews data analysis. For example,  if duplicate records exist, a customer might receive multiple marketing emails for the same promotion, leading to frustration and a poor customer experience. Now, suppose we have a table named orders, to identify these duplicate records, we can use a common SQL technique with the ROW_NUMBER() window function.  WITH CTE AS ( SELECT order_id, customer_id, order_date, amount, ROW_NUMBER() OVER (PAR...