0

I'm stuck to find a way to remove some duplicate rows in a MySQL database using some custom logic.

Actual datas :

id  name    population
1   CityA   1000
2   CityA   50
3   CityA   0
4   CityB   0
5   CityB   0
6   CityC   10

Desired result :

id  name    population
1   CityA   1000
4   CityB   0
6   CityC   10

I tried this query without success (it has deleted all rows for a city if all populations are equal to 0, like in the CityB example):

DELETE t 
FROM table AS t, table AS t2
WHERE t.id != t2.id
AND t.population <= t2.population

Could any super hero solve this super problem ?

[EDIT] The working solution : http://sqlfiddle.com/#!9/ea3e3/2

2
  • If you have several same names with max population , then you need keep them all or only one ? Commented Jun 1, 2017 at 20:39
  • I want to keep just one row in this case (without caring of the selected row) Commented Jun 1, 2017 at 20:48

3 Answers 3

2

You can do a join with a subquery that returns the ID of the row with the highest population for each city.

DELETE t1
FROM YourTable AS t1
JOIN (SELECT name, MAX(id) AS maxid
      FROM YourTable AS t2
      JOIN (SELECT name, MAX(population) AS maxpop
            FROM YourTable
            GROUP BY name) AS t3 
      ON t2.name = t3.name AND t2.population = t3.maxpop
      GROUP BY t2.name) AS t4
ON t1.name = t4.name AND t1.id != t4.maxid

I needed an extra level of subquery nesting because you have multiple rows with the same population for a name. So it first needs to get the max population for each name, then select a particular ID within that group with MAX(id).

Sign up to request clarification or add additional context in comments.

6 Comments

I just tried your solution, and it has deleted all the rows ;p
@GuillaumeSTLR spencer7593 is right, I fixed the query.
@spencer7593 Correct. Another way to do it is with WHERE id NOT IN (subquery that returns all the maxid).
@Barmar: or we could use an anti-join. Delete all rows except those that have an id that matches a row in a list of id values we want to keep.
Yep, that's the pattern I usually use. But this query is complex enough. :)
|
1

Looks like you want to "match" on the city in the name column.

Write a SELECT statement first, and test that, before you convert it into a DELETE statement.

SELECT d.*
  FROM table d
  JOIN table k
    ON k.name        = d.name 
   AND k.population  > d.population 
   AND k.id         <> d.id

We want to keep the the rows from k, and delete the row from d.

Convert that into a DELETE statement by replacing the SELECT keyword with DELETE.

Note that if there are multiple rows with the same "highest" population for a city, this query won't identify those. To get rid of the "duplicates" with the same population value, we need a slightly different approach.

I'd use an anti-join:

SELECT d.*
  FROM table d
  LEFT
  JOIN ( SELECT MIN(r.id) AS min_id
           FROM ( SELECT t.name
                       , MAX(t.population) AS max_pop
                    FROM table t
                   GROUP BY t.name
                ) s
           JOIN table r
             ON r.name       = s.name
            AND r.population = s.max_pop
          GROUP BY r.name
       ) q
    ON q.min_id = d.id
 WHERE q.min_id IS NULL

Inline view q should return a list of id values, from the rows we want to keep. Any row that has an id that isn't in that list is one we want to remove.

If MySQL balks at the table references in the inline view, we can wrap that in yet another inline view as a workaround.

SELECT d.*
  FROM table d
  LEFT
  JOIN ( SELECT q.min_id
           FROM ( SELECT MIN(r.id) AS min_id
                    FROM ( SELECT t.name
                                , MAX(t.population) AS max_pop
                             FROM table t
                            GROUP BY t.name
                         ) s
                    JOIN table r
                      ON r.name       = s.name
                     AND r.population = s.max_pop
                   GROUP BY r.name
                ) q
       ) p
    ON p.min_id = d.id
 WHERE p.min_id IS NULL

Convert that to a DELETE statement by replacing the outermost SELECT keyword with DELETE keyword.

2 Comments

See my answer for the "different approach".
@spencer7593 : Thank you very much for your help. However, I focused on Barmar solution and it just has worked ;-)
0
CREATE TABLE new_table (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(30),
  population INT
);

INSERT INTO new_table (name, population)
SELECT old.name, MAX(old.population)
FROM current_table old
GROUP BY old.name;

RENAME TABLE current_table TO archive_table
, new_table TO current_table;

Then once you've checked the data

DROP TABLE archive_table;

3 Comments

If there are other tables with foreign keys pointing to this table, the IDs will change as a result. Although that's probably a problem with the DELETE methods as well, since they'll become invalid when the related rows aree deleted.
And if Guillame's got millions of rows in his database, then an in-place delete has complications around locking. But I suspect that neither apply.
It's the geonames database so possibly millions of rows yes

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.