Delete duplicate rows using custom logic

Question

I'm stuck to find a way to remove some duplicate rows in a MySQL database using some custom logic.

Actual datas :

id  name    population
1   CityA   1000
2   CityA   50
3   CityA   0
4   CityB   0
5   CityB   0
6   CityC   10

Desired result :

id  name    population
1   CityA   1000
4   CityB   0
6   CityC   10

I tried this query without success (it has deleted all rows for a city if all populations are equal to 0, like in the CityB example):

DELETE t 
FROM table AS t, table AS t2
WHERE t.id != t2.id
AND t.population <= t2.population

Could any super hero solve this super problem ?

[EDIT] The working solution : http://sqlfiddle.com/#!9/ea3e3/2

If you have several same names with max population , then you need keep them all or only one ? — Oto Shavadze
– Oto Shavadze, Commented Jun 1, 2017 at 20:39
I want to keep just one row in this case (without caring of the selected row) — Guillaume Sainthillier
– Guillaume Sainthillier, Commented Jun 1, 2017 at 20:48

Barmar · Accepted Answer · 2017-06-01 21:10:22Z

2

You can do a join with a subquery that returns the ID of the row with the highest population for each city.

DELETE t1
FROM YourTable AS t1
JOIN (SELECT name, MAX(id) AS maxid
      FROM YourTable AS t2
      JOIN (SELECT name, MAX(population) AS maxpop
            FROM YourTable
            GROUP BY name) AS t3 
      ON t2.name = t3.name AND t2.population = t3.maxpop
      GROUP BY t2.name) AS t4
ON t1.name = t4.name AND t1.id != t4.maxid

I needed an extra level of subquery nesting because you have multiple rows with the same population for a name. So it first needs to get the max population for each name, then select a particular ID within that group with MAX(id).

edited Jun 1, 2017 at 21:10

answered Jun 1, 2017 at 20:33

Barmar

789k57 gold badges555 silver badges669 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Guillaume Sainthillier Over a year ago

I just tried your solution, and it has deleted all the rows ;p

Barmar Over a year ago

@GuillaumeSTLR spencer7593 is right, I fixed the query.

Barmar Over a year ago

@spencer7593 Correct. Another way to do it is with WHERE id NOT IN (subquery that returns all the maxid).

spencer7593 Over a year ago

@Barmar: or we could use an anti-join. Delete all rows except those that have an id that matches a row in a list of id values we want to keep.

Barmar Over a year ago

Yep, that's the pattern I usually use. But this query is complex enough. :)

|

spencer7593 · Accepted Answer · 2017-06-01 20:55:19Z

1

Looks like you want to "match" on the city in the name column.

Write a SELECT statement first, and test that, before you convert it into a DELETE statement.

SELECT d.*
  FROM table d
  JOIN table k
    ON k.name        = d.name 
   AND k.population  > d.population 
   AND k.id         <> d.id

We want to keep the the rows from k, and delete the row from d.

Convert that into a DELETE statement by replacing the SELECT keyword with DELETE.

Note that if there are multiple rows with the same "highest" population for a city, this query won't identify those. To get rid of the "duplicates" with the same population value, we need a slightly different approach.

I'd use an anti-join:

SELECT d.*
  FROM table d
  LEFT
  JOIN ( SELECT MIN(r.id) AS min_id
           FROM ( SELECT t.name
                       , MAX(t.population) AS max_pop
                    FROM table t
                   GROUP BY t.name
                ) s
           JOIN table r
             ON r.name       = s.name
            AND r.population = s.max_pop
          GROUP BY r.name
       ) q
    ON q.min_id = d.id
 WHERE q.min_id IS NULL

Inline view q should return a list of id values, from the rows we want to keep. Any row that has an id that isn't in that list is one we want to remove.

If MySQL balks at the table references in the inline view, we can wrap that in yet another inline view as a workaround.

SELECT d.*
  FROM table d
  LEFT
  JOIN ( SELECT q.min_id
           FROM ( SELECT MIN(r.id) AS min_id
                    FROM ( SELECT t.name
                                , MAX(t.population) AS max_pop
                             FROM table t
                            GROUP BY t.name
                         ) s
                    JOIN table r
                      ON r.name       = s.name
                     AND r.population = s.max_pop
                   GROUP BY r.name
                ) q
       ) p
    ON p.min_id = d.id
 WHERE p.min_id IS NULL

Convert that to a DELETE statement by replacing the outermost SELECT keyword with DELETE keyword.

edited Jun 1, 2017 at 20:55

answered Jun 1, 2017 at 20:38

spencer7593

109k15 gold badges122 silver badges148 bronze badges

2 Comments

Barmar Over a year ago

See my answer for the "different approach".

Guillaume Sainthillier Over a year ago

@spencer7593 : Thank you very much for your help. However, I focused on Barmar solution and it just has worked ;-)

symcbean · Accepted Answer · 2017-06-01 20:39:27Z

0

CREATE TABLE new_table (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(30),
  population INT
);

INSERT INTO new_table (name, population)
SELECT old.name, MAX(old.population)
FROM current_table old
GROUP BY old.name;

RENAME TABLE current_table TO archive_table
, new_table TO current_table;

Then once you've checked the data

DROP TABLE archive_table;

answered Jun 1, 2017 at 20:39

symcbean

48.4k6 gold badges64 silver badges99 bronze badges

3 Comments

Barmar Over a year ago

If there are other tables with foreign keys pointing to this table, the IDs will change as a result. Although that's probably a problem with the DELETE methods as well, since they'll become invalid when the related rows aree deleted.

symcbean Over a year ago

And if Guillame's got millions of rows in his database, then an in-place delete has complications around locking. But I suspect that neither apply.

Guillaume Sainthillier Over a year ago

It's the geonames database so possibly millions of rows yes

Collectives™ on Stack Overflow

Delete duplicate rows using custom logic

3 Answers 3

6 Comments

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related