3

i have a PostgreSQL table describing lines between two points. It contains two columns, A and B, integers representing the id of a point (described in another table).

But each line is duplicated in the table, as the line going from A to B is the same that the line going from B to A.

I'd like to remove the duplicates, but i can't find an aggregate function working on two columns, to regroup AB and BA lines then to remove one.

Thanks :)

2
  • 1
    There are at least two possible ways I can envisage the duplication in this case. 1). that the Lines table contains a record that points to point_id(1), point_id(2) and also a record that points to point_id(2), point_id(1). 2. That the two lines all have different point_id values, but that when you look in the point table, different point_ids can have the same co-ordinates. Could you give examples to clarify? Commented Jun 12, 2012 at 12:57
  • Thanks for your comment. The duplicates are in point_ids, not in coordinates, so it's the first case of your question. Moreover, alls the lines are duplicated, for each AB line there is a BA line, it's a result of the table creation aglorithm. Commented Jun 12, 2012 at 13:12

2 Answers 2

8

Identifying the duplicates:

select least(a,b), greatest(a,b), count(*)
from the_table
group by least(a,b), greatest(a,b)
having count(*) > 1

I think you should be able to delete one of the pairs using:

delete from the_table
where (least(a,b), greatest(a,b)) in (
                select least(a,b), greatest(a,b)
                from the_table
                group by least(a,b), greatest(a,b)
                having count(*) > 1);

(Not tested!)

Sign up to request clarification or add additional context in comments.

4 Comments

Hmmm... it now seems to me this will delete more than just the duplicate records
@LaurentJégou - This will delete every record for any line that has duplicates; if a line exists as a,b AND b,a, both records will be deleted. It needs only to have WHERE (a,b) IN ( then it will only delete instances of the line where a<b. This then assumes that any line with a duplicate exists as both a,b and b,a, and also assumes that no line will have multiple a,b entries (so that deleting all of the b,a entries will be sufficient). In which case, it becomes functionally very similar to my answer, but with a little extra complexity ;)
I agree with Dems comment, i used the "where (a, b) in" version, and it deleted only duplicates.
@LaurentJégou - In which case I think this is slightly over complex as it exhibits the same behaviour (and assumptions) as my simpler answer. I would expect, though I have not tested, that this answer would also be slower (more cpu, more reads) than the simpler answer.
2

I've left a comment, but I'm going to assume for now that the only difference between two duplicate records is that they have the same point_id values, but in reverse order.

In which case, it is actually quite simple to do...

DELETE
  line
WHERE
  point_id_a > point_id_b
  AND EXISTS (SELECT *
                FROM line AS lookup
               WHERE lookup.point_id_a = line.point_id_b
                 AND lookup.point_id_b = line.point_id_a
             )

2 Comments

+1 This works assuming that the only duplicates have a,b swapped. It won't work if there are multiple rows with the same a,b
@Andomar - Correct, that's why I stated such an assumption :) But, interestingly, the accepted answer seems to be both incorrect (deletes all occurances, not just the duplicates) and even when corrected will make virtually the same assumption as my answer. (See my comment on the answer.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.