2

I have this query:

select count(distinct User_ID) from Web_Request_Log where Added_Timestamp like '20110312%' and User_ID Is Not Null;

User_ID and Added_Timestamp are indexed.

The query is painfully slow (we have millions of records and the table is growing fast).

I've read all the posts I could find about count and distinct, here, but they seem to be mostly syntax related. I'm interested in optimization and I'm wondering if I'm using the right tool for the job.

I can use an intermediate counter table to summarize overall hits, but I'd like a way to do this that would allow me to easily generate ad-hoc 'range' queries; i.e., what is the distinct visitor count for last week, or last month.

5
  • 3
    What are your indexes? Have you tried an explain on it? Commented Apr 7, 2011 at 16:07
  • 1
    Side notice :You don't need User_ID Is not null in WHERE. Count by itself returns a number of not-null values. Commented Apr 7, 2011 at 16:08
  • what's the data type of Added_Timestamp? Is it a string? If it was DATETIME you could use `Added_Timestamp BETWEEN '2011-03-12 00:00:00' AND '2011-03-12 23:59:59' which would probably be much faster than 'LIKE' Commented Apr 7, 2011 at 16:11
  • Yes, explain looks like this: 1 SIMPLE Web_Request_Log range Web_Request_Log_User_ID,Web_Request_Log_Added_Timestamp Web_Request_Log_Added_Timestamp 18 NULL 255578 Using where Commented Apr 9, 2011 at 4:19
  • TimeStamp is a string and is a legacy issue; I've added a mirror date field as part of the migration to make this a bit more efficient Commented Apr 9, 2011 at 4:21

1 Answer 1

4

Did some tests to see if GROUP BY can help and it seems it can.

On table A with ~8M records and ~340K distinct records for a given non-indexed field:

GROUP BY           17 seconds
COUNT(DISTINCT ..) 21 seconds

On table A with ~2M records and ~50K distinct records for a given indexed field:

GROUP BY           200 ms
COUNT(DISTINCT ..) 2.5 seconds

This is MySql with InnoDB engine, BTW.

I can't find any relevant documentation though, and I wonder if that comparison is dependent on the data (how many duplicates there are).

For your table, the GROUP BY query will look like this:

SELECT COUNT(t.c)
FROM (SELECT 1 AS c
      FROM Web_Request_Log
      WHERE Added_Timestamp LIKE '20110312%'
      AND User_ID IS NOT NULL
      GROUP BY User_ID
      ) AS t

Try it and let us know if it's quicker :)

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you; I'll try it and let you know.
so, I tried and the request timed out: Error Code: 2006 MySQL server has gone away. Also, I'm running on RDS and the timing you're getting (21 seconds in a table with 8 million records) is much, much better than what I'm getting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.