I have a large table (2M+ records with loads of columns). I intend to do a GROUP BY for deduplication purposes. I would like to know which of the following two strategies would perform better?
- GROUP BY multiple columns(col_a, col_b, col_c)
- ADD a new column dedup_col consisting of a normalized string formed using col_a,col_b,col_c and then do a GROUP BY on dedup_col. The dedup_col will be populated beforehand.
I know I can run benchmarks but I would like some theoretical input before I start implementation.