Posts

Showing posts with the label Data Cleaning

Data Management: Handling Dupes

Image
Duplication refers to a type of mutation in which one or more copies of a DNA segment is produced   Today's post is inspired by a course I recently completed on DataCamp around Cleaning Data in SQL. In the context of Data Quality, we are often interested in a dimension known as "Uniqueness", which can be thought of as having one copy of each record in a table.  Now you may be asking yourself, "Why is duplication bad?"  To that, I would say that it's expensive to store copies, and it also skews data analysis. For example,  if duplicate records exist, a customer might receive multiple marketing emails for the same promotion, leading to frustration and a poor customer experience. Now, suppose we have a table named orders, to identify these duplicate records, we can use a common SQL technique with the ROW_NUMBER() window function.  WITH CTE AS ( SELECT order_id, customer_id, order_date, amount, ROW_NUMBER() OVER (PAR...

Data Gone Bad: How to Minimize Its Risks

Image
A skull and crossbones is never a good sign—unless you're a pirate.   Today's post is inspired by a video I saw about a customer (let's call her Sally) who purchased a beef pie from a bakery that went bad. Sally took a bite and, to her dismay, discovered creepy crawlies inside. She was understandably upset, but I think this is a great metaphor for what it is like using outdated data.  In both cases, you consume something (be it the pie or the dataset) expecting good results, only to find out later that there's a problem.   Now, just one bad experience could make Sally stop buying from that bakery and lose trust in all store-bought meat pastries. Similarly, you might stop using a dataset and start relying solely on your gut instinct for decisions. Wouldn't it be great if pastries could indicate when they're nearing their expiration date, allowing bakery owners to sell them faster or remove them from the shelves altogether? While that might be ideal, mistakes happ...