Remove duplicates in tabular data having high cardinality and degree

Big Data, Data, Data Science, Data Visualization, DataOps, Knowledge, Tableau

Remove duplicates in tabular data having high cardinality and degree | Optimized Approach

If you’re into IT or related business, then at least once in a lifetime you come across a situation where you deal with tabular data having redundancy and you need to remove it.

The below approach is based on SQL which can be analogically transferred to Excel and other traditional(or upcoming) RDBMS.
The cynosure -> How to remove duplicates?

Let’s assume there are 4 columns – A, B, C, D in a table TABLE1 and you need to find the redundancy on the basis of B and C.
So we can approach it with either group by or partitioning our data on the basis of B and C i.e.

using group by
SELECT B, C, COUNT(_*) AS CNT FROM TABLE1 GROUP BY B, C HAVING CNT>1;
using partition
SELECT *, ROW_NUMBER() OVER (PARTITION BY B, C) as RN FROM TABLE1 WHERE RN>1;
There are many other solutions like using Common Table Expressions, SSIS package, etc but those are not generalized approaches.

Now, think of a use-case where you have huge cardinality(say 100 million rows) and degree(say 1500 columns) and you need to consider at least 100 columns for the redundancy.
Grouping or Partitioning on the basis of a huge number of columns will be a non-optimized approach as these methods work great up to 5-7 columns(depending upon the cardinality).
If the cardinality is huge, it might go under execution for hours!

Think of an approach where you can convert all of those columns into a single column.
Yes, you’re right – HASHING.

SQL Server has a built-in function called HashBytes to support data hashing.
HashBytes ( ‘ALGO’, { @input | ‘input’ } )
ALGO = MD2 | MD4 | MD5 | SHA | SHA1 | SHA2_256 | SHA2_512

Note: For hashing, sequence of input matters!

The hash functions produce the same output for the same inputs and that too within a specified length of characters(so your long paragraphs are also covered under redundancy).

If you design your DDL query like
create table db_name.TABLE1(

A VARCHAR(1000),
B VARCHAR(1000),
C VARCHAR(1000),
D VARCHAR(1000),
HashValue as hashbytes (‘SHA1’, concat_ws(‘_’, A, B, C, D)) persisted);

Using “persisted” is helpful here – it specifies that the database engine will physically store the computed values in the table, and update the values when any other columns on which the computed column depends are updated. Moreover, if you apply index on the column it would become more efficient.

Now group by(or partitioning) on a single column will be faster than applying on multiple columns.

Would love to hear from you on LinkedIn. User: chanpreet-singh-ai

0 Comments

data cleaning, data duplicacy, data redundancy, sql