0

I have a large table (2M+ records with loads of columns). I intend to do a GROUP BY for deduplication purposes. I would like to know which of the following two strategies would perform better?

  1. GROUP BY multiple columns(col_a, col_b, col_c)
  2. ADD a new column dedup_col consisting of a normalized string formed using col_a,col_b,col_c and then do a GROUP BY on dedup_col. The dedup_col will be populated beforehand.

I know I can run benchmarks but I would like some theoretical input before I start implementation.

3 Answers 3

6

For the love of God, go with option 1. Don't resort to #2 unless you have serious performance options with #1 and you have exhausted all other options (including indexing) to solve it.

Option #2 is a terrible idea. Effectively you are reinventing the wheel by implementing a poor man's version of an index...badly.

Never, Ever, Ever, de-normalize (that's what you are doing in option 2) your data for performance until you have identified a performance problem. Even then, you probably shouldn't do it.

FYI: 2 Million records is NOT a big database if you have your indexes set up correctly.

Sign up to request clarification or add additional context in comments.

Comments

4

I'd run an EXPLAIN PLAN on various queries to compare costs. That'll be worth more than any theoretical answer you get here. Let PostgreSQL tell you what it'll do.

Comments

0

The method I usually end up using for this is to use the ctid key. For example:

delete from yourtable
where ctid not in (
SELECT  MAX(dt.ctid)
FROM yourtable As dt
GROUP BY dt.col_a, dt.col_b, dt.col_c);

But there are so many other options... a lot depends on the table, the number of indexes, and so on... deletes can be expensive though as I've also had instances where it was better to create a new table from a select of the unique rows, then drop the orignal table and rename the new one to have the original name.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.