MySQL Multi Duplicate Record Merging

Question

A previous DBA managed a non relational table with 2.4M entries, all with unique ID's. However, there are duplicate records with different data in each record for example:

+---------+---------+--------------+----------------------+-------------+
| id      | Name    | Address      | Phone   | Email      | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1       | bob     | 12 Some Road | 02456   |            |             | 
| 2       | bobby   |              | 02456   | bob@domain |             |
| 3       | bob     | 12 Some Rd   | 02456   |            | 2010-07-13  | 
| 4       | sir bob |              | 02456   |            |             |
| 5       | bob     | 12SomeRoad   | 02456   |            |             |
| 6       | mr bob  |              | 02456   |            |             |
| 7       | robert  |              | 02456   |            |             |
+---------+---------+--------------+---------+------------+-------------+

This isnt the exact table - the real table has 32 columns - this is just to illustrate

I know how to identify the duplicates, in this case i'm using the phone number. I've extracted the duplicates into a seperate table - there's 730k entires in total.

What would be the most efficient way of merging these records (and flagging the un-needed records for deletion)?

I've looked at using UPDATE with INNER JOIN's, but there are several WHERE clauses needed, because i want to update the first record with data from subsequent records, where that subsequent record has additional data the former record does not.

I've looked at third party software such as Fuzzy Dups, but i'd like a pure MySQL option if possible

The end goal then is that i'd be left with something like:

+---------+---------+--------------+----------------------+-------------+
| id      | Name    | Address      | Phone   | Email      | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1       | bob     | 12 Some Road | 02456   | bob@domain | 2010-07-13  | 
+---------+---------+--------------+---------+------------+-------------+

Should i be looking at looping in a stored procedure / function or is there some real easy thing i've missed?

U will have to write a procedure...But I want to know more what u want, u want to group according to phone, but if in that group u have different name or address etc, then which value u will prefer???? — Sashi Kant
– Sashi Kant, Commented Oct 26, 2011 at 5:34
I guess there's got to be some stable point to work from, so if we just say the first record should be the correct record. This may not be the case for all records but I'm willing to risk it! — Rucia
– Rucia, Commented Oct 26, 2011 at 6:31

Sashi Kant · Accepted Answer · 2011-10-26 06:23:10Z

1

U have to create a PROCEDURE, but before that create ur own temp_table like :

Insert into temp_table(column1, column2,....) values (select column1, column2... from myTable GROUP BY phoneNumber)

U have to create the above mentioned physical table so that u can run a cursor on it.

create PROCEDURE myPROC {

create a cursor on temp::
fetch the phoneNumber and id of the current row from the temp_table to the local variable(L_id, L_phoneNum).

And here too u need to create a new similar_tempTable which will contain the values as

Insert into similar_tempTable(column1, column2,....) values (Select column1, column2,.... from myTable where phoneNumber=L_phoneNumber)

The next step is to extract the values of each column u want from similar_tempTable and update into the the row of myTable where id=L_id and delete the rest duplicate rows from myTable.

And one more thing, truncate the similar_tempTable after every iteration of the cursor...

Hope this will help u...

answered Oct 26, 2011 at 6:23

Sashi Kant

13.5k9 gold badges50 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Rucia Over a year ago

So if I understand you correctly I run the cursor over the temp table and create another temp table which then gets truncated. If that's so, then is that going to be slow creating tables on the fly? Or have I misunderstood

Sashi Kant Over a year ago

U got that perfectlty, It wont b that much slow, just put an index on phone number, and consider this that, this will be a one time-run proc. So u have to worry only once

Rucia Over a year ago

Only have to worry once - true enough. Thanks for your help :)

Sashi Kant Over a year ago

@Rucia: Dont say thanks, its just learning and sharing, because of ur question, I also learned something...:-)

Collectives™ on Stack Overflow

MySQL Multi Duplicate Record Merging

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related