1

I have the following table:

tbl
source    type    date
---       ---     ---
google    A       2010-02-25
google    A       2013-04-11
facebook  C       2008-10-22
facebook  C       2007-01-28

I want to keep only a single entry of each source, and the criteria is select the source tuple with min(date) group by source. The table consists of millions of records, and I'm looking for an efficient way to delete redundant records.

2
  • Does this table have any unique identifier (i.e. primary key, id field etc.)? Commented Oct 9, 2017 at 13:40
  • @ChrisJ nah, it doesn't. Commented Oct 9, 2017 at 13:43

4 Answers 4

3

In MySQL, you can do this using a join:

delete t
    from t join
         (select source, min(date) as mindate
          from t
          group by source
         ) tt
         on t.source = tt.source
    where t.date > tt.mindate;

The only way -- off-hand -- that I can think to make this more efficient is to store the aggregation result in a subquery and add an index to it.

I can also add that regardless of the computation for determining the rows to delete, deleting lots of rows in a table in inefficient. Usually, I would recommend a three-step approach:

  1. Write a query to generate the table you want and store the results in a temporary table.
  2. Truncate the original table.
  3. Re-insert the (much) smaller number of rows.
Sign up to request clarification or add additional context in comments.

Comments

0

In Microsoft SQL, you can try this.

;
WITH cte 
        AS (SELECT ROW_NUMBER() OVER (PARTITION BY source, type
                                        ORDER BY createdate) RN
            FROM   tbsource)
DELETE FROM cte
WHERE  RN > 1;

7 Comments

@GordonLinoff, I'm sorry. The question doesn't contain it is for MySQL.
There is a mysql tag, but probably should be in the title too if that's what it should be.
@mikato, Yeah correct! Even SQL tag is there which makes this question to see as a generic one.
And what is Microsoft SQL? Likely you meant Microsoft SQL Server or Microsoft TSQL. This is not splitting hairs since MS Access has a SQL dialect of its own distinct from MSSQL and both are MS products. And Microsoft Query spans several interfaces.
@Parfait, You're correct. I meant Microsoft SQL Server or Transact SQL here.
|
0
delete from t where date not in (select al.d from (select min(date) as d  from t group by source )al);

1 Comment

Thank you for this code snippet, which might provide some limited, immediate help. A proper explanation would greatly improve its long-term value by showing why this is a good solution to the problem, and would make it more useful to future readers with other, similar questions. Please edit your answer to add some explanation, including the assumptions you've made.
0

Add an identity column to the duplicate table as a serial number that acts as a row unique identifier(auto incremental ascending order):

 alter table tbl add sno int identity(1,1)

table

This query selects only non duplicated rows with min(date):

(select min(date),sno From tbl group by source) 

So "sno" will be equals to "1" and "4".

Now join with this table, and delete the records of the join that are duplicated (t.sno is null)

delete E from tbl E
    left join
    (select min(date),sno From tbl group by source) T on E.sno=T.sno
where T.sno is null

table3

Solution adapted from method 3 of this link: LINK

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.