Postgres table : Find duplicates in two columns, regardless of order

Question

i have a PostgreSQL table describing lines between two points. It contains two columns, A and B, integers representing the id of a point (described in another table).

But each line is duplicated in the table, as the line going from A to B is the same that the line going from B to A.

I'd like to remove the duplicates, but i can't find an aggregate function working on two columns, to regroup AB and BA lines then to remove one.

Thanks :)

There are at least two possible ways I can envisage the duplication in this case. 1). that the Lines table contains a record that points to point_id(1), point_id(2) and also a record that points to point_id(2), point_id(1). 2. That the two lines all have different point_id values, but that when you look in the point table, different point_ids can have the same co-ordinates. Could you give examples to clarify? — MatBailie
– MatBailie, Commented Jun 12, 2012 at 12:57
Thanks for your comment. The duplicates are in point_ids, not in coordinates, so it's the first case of your question. Moreover, alls the lines are duplicated, for each AB line there is a BA line, it's a result of the table creation aglorithm. — Laurent Jégou
– Laurent Jégou, Commented Jun 12, 2012 at 13:12

Andomar · Accepted Answer · 2012-06-12 12:56:38Z

8

Identifying the duplicates:

select least(a,b), greatest(a,b), count(*)
from the_table
group by least(a,b), greatest(a,b)
having count(*) > 1

I think you should be able to delete one of the pairs using:

delete from the_table
where (least(a,b), greatest(a,b)) in (
                select least(a,b), greatest(a,b)
                from the_table
                group by least(a,b), greatest(a,b)
                having count(*) > 1);

(Not tested!)

edited Jun 12, 2012 at 12:56

Andomar

239k55 gold badges387 silver badges412 bronze badges

answered Jun 12, 2012 at 12:55

user330315

Sign up to request clarification or add additional context in comments.

4 Comments

Andomar Over a year ago

Hmmm... it now seems to me this will delete more than just the duplicate records

MatBailie Over a year ago

@LaurentJégou - This will delete every record for any line that has duplicates; if a line exists as a,b AND b,a, both records will be deleted. It needs only to have WHERE (a,b) IN ( then it will only delete instances of the line where a<b. This then assumes that any line with a duplicate exists as both a,b and b,a, and also assumes that no line will have multiple a,b entries (so that deleting all of the b,a entries will be sufficient). In which case, it becomes functionally very similar to my answer, but with a little extra complexity ;)

Laurent Jégou Over a year ago

I agree with Dems comment, i used the "where (a, b) in" version, and it deleted only duplicates.

MatBailie Over a year ago

@LaurentJégou - In which case I think this is slightly over complex as it exhibits the same behaviour (and assumptions) as my simpler answer. I would expect, though I have not tested, that this answer would also be slower (more cpu, more reads) than the simpler answer.

MatBailie · Accepted Answer · 2012-06-12 13:00:41Z

2

I've left a comment, but I'm going to assume for now that the only difference between two duplicate records is that they have the same point_id values, but in reverse order.

In which case, it is actually quite simple to do...

DELETE
  line
WHERE
  point_id_a > point_id_b
  AND EXISTS (SELECT *
                FROM line AS lookup
               WHERE lookup.point_id_a = line.point_id_b
                 AND lookup.point_id_b = line.point_id_a
             )

answered Jun 12, 2012 at 13:00

MatBailie

87.5k19 gold badges112 silver badges144 bronze badges

2 Comments

Andomar Over a year ago

+1 This works assuming that the only duplicates have a,b swapped. It won't work if there are multiple rows with the same a,b

MatBailie Over a year ago

@Andomar - Correct, that's why I stated such an assumption :) But, interestingly, the accepted answer seems to be both incorrect (deletes all occurances, not just the duplicates) and even when corrected will make virtually the same assumption as my answer. (See my comment on the answer.)

Collectives™ on Stack Overflow

Postgres table : Find duplicates in two columns, regardless of order

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related