Performance of GROUP BY in Postgres

Question

I have a large table (2M+ records with loads of columns). I intend to do a GROUP BY for deduplication purposes. I would like to know which of the following two strategies would perform better?

GROUP BY multiple columns(col_a, col_b, col_c)
ADD a new column dedup_col consisting of a normalized string formed using col_a,col_b,col_c and then do a GROUP BY on dedup_col. The dedup_col will be populated beforehand.

I know I can run benchmarks but I would like some theoretical input before I start implementation.

JohnFx · Accepted Answer · 2012-02-22 04:51:35Z

6

For the love of God, go with option 1. Don't resort to #2 unless you have serious performance options with #1 and you have exhausted all other options (including indexing) to solve it.

Option #2 is a terrible idea. Effectively you are reinventing the wheel by implementing a poor man's version of an index...badly.

Never, Ever, Ever, de-normalize (that's what you are doing in option 2) your data for performance until you have identified a performance problem. Even then, you probably shouldn't do it.

FYI: 2 Million records is NOT a big database if you have your indexes set up correctly.

answered Feb 22, 2012 at 4:51

JohnFx

35k19 gold badges108 silver badges169 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

duffymo · Accepted Answer · 2012-02-22 03:12:59Z

4

I'd run an EXPLAIN PLAN on various queries to compare costs. That'll be worth more than any theoretical answer you get here. Let PostgreSQL tell you what it'll do.

answered Feb 22, 2012 at 3:12

duffymo

310k46 gold badges376 silver badges571 bronze badges

Comments

Jer In Chicago · Accepted Answer · 2012-02-22 03:29:31Z

0

The method I usually end up using for this is to use the ctid key. For example:

delete from yourtable
where ctid not in (
SELECT  MAX(dt.ctid)
FROM yourtable As dt
GROUP BY dt.col_a, dt.col_b, dt.col_c);

But there are so many other options... a lot depends on the table, the number of indexes, and so on... deletes can be expensive though as I've also had instances where it was better to create a new table from a select of the unique rows, then drop the orignal table and rename the new one to have the original name.

answered Feb 22, 2012 at 3:29

Jer In Chicago

8185 silver badges7 bronze badges

1 Comment

Jer In Chicago Over a year ago

Also, check out: postgresonline.com/journal/archives/…

Collectives™ on Stack Overflow

Performance of GROUP BY in Postgres

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related