Finding duplicates in mysql

Question

SELECT COUNT(organization.ID)
FROM organization
WHERE organization.NAME IN (
    SELECT organization.NAME
    FROM organization
    WHERE organization.NAME <> ''
        AND organization.APPROVED = 0 
        AND organization.CREATED_AT > '2012-07-31 04:31:08'
    GROUP BY organization.NAME
    HAVING COUNT(organization.ID) > 1
)

This query finds duplicates, the problem is that it takes 6 seconds for the page to load because of the inner statement. Is there a way to make it run faster? MySQL database version 5.1.

Isn't the inner statement useless? SELECT COUNT(organization.ID) FROM organization WHERE organization.NAME <> '' AND organization.APPROVED =0 AND organization.CREATED_AT > '2012-07-31 04:31:08' GROUP BY organization.NAME HAVING COUNT( organization.ID ) >1) — SativaNL
– SativaNL, Commented Aug 31, 2012 at 20:47
No , mine for instance returns 67 duplicates , your query breaks it down to 55,10,2 which adds up to 67 — Marin
– Marin, Commented Aug 31, 2012 at 20:53
@SativaNL: the OP query is getting a count of all organizations that have a duplicate name, but ONLY for those organization names that have two (or more rows) with the specified predicates on APPROVED and CREATED_AT. The OP query will include additional rows in the total count. — spencer7593
– spencer7593, Commented Aug 31, 2012 at 21:38

Gordon Linoff · Accepted Answer · 2012-08-31 20:53:38Z

1

Yes. This is slow because MySQL is slow in processing "in" queries. You can fix it by using this instead:

SELECT COUNT(organization.ID)
FROM organization o
WHERE exists (
    SELECT organization.NAME
    FROM organization o2
    WHERE organization.NAME <> ''
        AND organization.APPROVED = 0 
        AND organization.CREATED_AT > '2012-07-31 04:31:08' and
        organization.name = o.organization.name
    GROUP BY organization.NAME
    HAVING COUNT(organization.ID) > 1
)

answered Aug 31, 2012 at 20:53

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

edze · Accepted Answer · 2012-08-31 20:52:29Z

0

Try to avoid IN.

SELECT COUNT(organization.ID)
FROM 
    organization
    INNER JOIN 
    (
        SELECT organization.NAME
        FROM organization
        WHERE organization.NAME <> ''
            AND organization.APPROVED = 0 
            AND organization.CREATED_AT > '2012-07-31 04:31:08'
        GROUP BY organization.NAME
        HAVING COUNT(organization.ID) > 1
    ) AS t ON organization.NAME = t.Name

answered Aug 31, 2012 at 20:52

edze

3,0331 gold badge25 silver badges29 bronze badges

1 Comment

Marin Over a year ago

This one is pretty fast, will test it later on again thanks :)

jtheman · Accepted Answer · 2012-08-31 20:58:01Z

0

I also find making indexes for the db fields included vastly improves speed in complex queries.

answered Aug 31, 2012 at 20:58

jtheman

7,5013 gold badges32 silver badges40 bronze badges

1 Comment

edze Over a year ago

I think he has already indexes. The problem is the IN it will execute the statement for each row.

spencer7593 · Accepted Answer · 2012-08-31 21:40:47Z

If what you want to return is a total "count" of all duplicates, but only for those organizations NAMES that have two or more rows with the specified predicates on APPROVED and CREATED_AT, then you could get by with an alternate statement to return an equivalent result:

SELECT SUM(c.cnt) 
  FROM ( SELECT COUNT(organization.ID) AS cnt
           FROM organization o
          WHERE o.NAME <> ''
          GROUP
             BY o.NAME
         HAVING SUM(o.APPROVED = 0 AND o.CREATED_AT > '2012-07-31 04:31:08') > 1
       ) c

MySQL can make use of a suitable covering index to satisfy this query, otherwise, this is likely a full scan on the organization table. But it avoids referencing the organization table twice, and avoids a JOIN operation.

One suitable covering index for this query would be:

ON organization (NAME, CREATED_AT, APPROVED, ID)

Note that if the ID column is guaranteed to be non-NULL (either a NOT NULL constraint or its the PRIMARY KEY of the table, you can avoid referencing that column, and you can leave that column out of the index definition.)

SELECT SUM(c.cnt) 
  FROM ( SELECT SUM(1) AS cnt
           FROM organization o
          WHERE o.NAME <> ''
          GROUP
             BY o.NAME
         HAVING SUM(o.APPROVED = 0 AND o.CREATED_AT > '2012-07-31 04:31:08') > 1
       ) c

The EXPLAIN output shows this query using the index to satisfy the query without referencing any data blocks from the table:

id  select_type  table       type    possible_keys    key              key_len  ref       rows  Extra                     
--  -----------  ----------  ------  ---------------  ---------------  -------  ------  ------  --------------------------
 1  PRIMARY      <derived2>  ALL     (NULL)           (NULL)           (NULL)   (NULL)       2                            
 2  DERIVED      o           index   organization_ix  organization_ix  44       (NULL)      29  Using where; Using index

Collectives™ on Stack Overflow

Finding duplicates in mysql

4 Answers 4

Comments

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related