Day 5 of using
@skrub_data
Back with more skrub nuggets :). The more I read the docs, the more useful things I find. Data cleaning is not the fun part, but it is a core part of any data science pipeline. skrub makes it a lot easier.
In this post we'll see how to Deduplicate categorical data with skrub
Real-world category columns often contain the same value written in slightly different ways.
For example, company names might show up with small typos:
Amazon, Amazn, Amaozn, Aamazon
skrub has a built-in deduplicate() function for this. It looks for similar strings, groups close variants together, and maps them back to a cleaner category.
This helps when categorical data has spelling errors, duplicate labels, or manual entries that should be treated as one category, especially when you know the correct values or when string similarity does not matter for the task.