Slow Query - Help with Optimization

Question

Hey guys. This is a follow-on from this question:

After getting the right data and making some tweaks based on requests from business, I've now got this mini-beast on my hands. This query should return the total number of new jobseeker registrations and the number of new uploaded CV's:

SELECT COUNT(j.jobseeker_id) as new_registrations,
(
    SELECT 
      COUNT(c.cv_id)
    FROM 
      tb_cv as c, tb_jobseeker, tb_industry
    WHERE
      UNIX_TIMESTAMP(c.created_at) >= '1241125200'
    AND 
      UNIX_TIMESTAMP(c.created_at) <= '1243717200'
    AND 
      tb_jobseeker.industry_id = tb_industry.industry_id
) 
AS uploaded_cvs
FROM 
  tb_jobseeker as j, tb_industry as i
WHERE
  j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND
  i.industry_id = j.industry_id
GROUP BY i.description, MONTH(j.created_at)

Notes: - The two values in the UNIX TIMESTAMP functions are passed in as parameters from the report module in our backend.

Every time I run it, MySQL chokes and lingers silently into the ether of the Interweb.

Help is appreciated.

Update: Hey guys. Thanks a lot for all the thoughtful and helpful comments. I'm only 2 weeks into my role here, so I'm still learning the schema. So, this query is somewhere between a thumbsuck and an educated guess. Will start to answer all your questions now.

You'll have to provide some information about the tables involved in this query... which columns have indexes, etc.? ... Also, could you format the query a little friendlier to the eyes? — jerryjvl
– jerryjvl, Commented Jun 5, 2009 at 7:37
What are you trying to do? You will need to give us the schema and the indexes you're using if you want help on optimization. — Nicolas Dumazet
– Nicolas Dumazet, Commented Jun 5, 2009 at 7:37
What's this cv_id? A full table? Also, in the subquery, tb_cv is not joined/linked to tb_jobseeker and tb_industry. Are you sure you want to do this? — Nicolas Dumazet
– Nicolas Dumazet, Commented Jun 5, 2009 at 7:42
@Midiane: I suspect the query should return the number of jobseeker registrations and CVs created per industry per month? Is that correct? — Tomalak
– Tomalak, Commented Jun 5, 2009 at 9:09
@Tomalak yes, you're right. Sorry guys, got called into two long meetings. — Midiane
– Midiane, Commented Jun 5, 2009 at 10:10

Tomalak · Accepted Answer · 2009-06-05 11:08:25Z

6

tb_cv is not connected to the other tables in the sub-query. I guess this is the root cause for the slow query. It causes generation of a Cartesian product, yielding a lot more rows than you probably need.

Other than that I'd say you need indexes on tb_jobseeker.created_at, tb_cv.created_at and tb_industry.industry_id, and you might want to get rid of the UNIX_TIMESTAMP() calls in the sub-query since they prevent use of an index. Use BETWEEN and the actual field values instead.

Here is my attempt at understanding your query and writing a better version. I guess you want to get the count of new jobseeker registrations and new uploaded CVs per month per industry:

SELECT 
  i.industry_id,
  i.description, 
  MONTH(j.created_at)            AS month_created,
  YEAR(j.created_at)             AS year_created,
  COUNT(DISTINCT j.jobseeker_id) AS new_registrations,
  COUNT(cv.cv_id)                AS uploaded_cvs
FROM 
  tb_cv AS cv
  INNER JOIN tb_jobseeker AS j ON j.jobseeker_id = cv.jobseeker_id
  INNER JOIN tb_industry  AS i ON i.industry_id  = j.industry_id
WHERE
  j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
  AND cv.created_at BETWEEN '2009-05-01' AND '2009-05-31'
GROUP BY 
  i.industry_id,
  i.description, 
  MONTH(j.created_at),
  YEAR(j.created_at)

A few things I noticed while writing the query:

you GROUP BY values you don't output in the end. Why? (I've added the grouped field to the output list.)
you JOIN three tables in the sub-query while only ever using values from one of them. Why? I don't see what it would be good for, other than filtering out CV records that don't have a jobseeker or an industry attached — which I find hard to imagine. (I've removed the entire sub-query and used a simple COUNT instead.)
Your sub-query returns the same value every time. Did you maybe mean to correlate it in some way, to the industry maybe?.
The sub-query runs once for every record in a grouped query without being wrapped in an aggregate function.

edited Jun 5, 2009 at 11:08

answered Jun 5, 2009 at 7:40

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Steve Weet Over a year ago

+1 Indeed there is a cartesian product on tb_cv and tb_jobseeker

Midiane Over a year ago

I must admit that I'm not strong on SQL and I used parts of other queries used in the system to get to this... sheepish

Midiane Over a year ago

hey tomalak, i just tried your query. it works perfectly! thanks. and it's much easier to read and not as complex as my poor attempt. i stripped out a few fields I don't need. Thanks a lot, seriously.

jerryjvl · Accepted Answer · 2009-06-05 07:40:15Z

0

First and foremost it may be worth moving the 'UNIX_TIMESTAMP' conversions to the other side of the equation (that is, perform a reverse function on the literal timestamp values at the other side of the >= and <=). That'll avoid the inner query having to perform the conversions for every record, rather than once for the query.

Also, why does the uploaded_cvs query not have any where clause linking it to the outer query? Am I missing something here?

answered Jun 5, 2009 at 7:40

jerryjvl

20.2k7 gold badges42 silver badges56 bronze badges

Collectives™ on Stack Overflow

Slow Query - Help with Optimization

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related