Best approach to getting count of distinct values in MySQL

Question

I have this query:

select count(distinct User_ID) from Web_Request_Log where Added_Timestamp like '20110312%' and User_ID Is Not Null;

User_ID and Added_Timestamp are indexed.

The query is painfully slow (we have millions of records and the table is growing fast).

I've read all the posts I could find about count and distinct, here, but they seem to be mostly syntax related. I'm interested in optimization and I'm wondering if I'm using the right tool for the job.

I can use an intermediate counter table to summarize overall hits, but I'd like a way to do this that would allow me to easily generate ad-hoc 'range' queries; i.e., what is the distinct visitor count for last week, or last month.

Side notice :You don't need User_ID Is not null in WHERE. Count by itself returns a number of not-null values. — a1ex07
– a1ex07, Commented Apr 7, 2011 at 16:08
what's the data type of Added_Timestamp? Is it a string? If it was DATETIME you could use `Added_Timestamp BETWEEN '2011-03-12 00:00:00' AND '2011-03-12 23:59:59' which would probably be much faster than 'LIKE' — Galz
– Galz, Commented Apr 7, 2011 at 16:11
Yes, explain looks like this: 1 SIMPLE Web_Request_Log range Web_Request_Log_User_ID,Web_Request_Log_Added_Timestamp Web_Request_Log_Added_Timestamp 18 NULL 255578 Using where — Lee Hinde
– Lee Hinde, Commented Apr 9, 2011 at 4:19
TimeStamp is a string and is a legacy issue; I've added a mirror date field as part of the migration to make this a bit more efficient — Lee Hinde
– Lee Hinde, Commented Apr 9, 2011 at 4:21

Galz · Accepted Answer · 2011-04-07 16:46:48Z

4

Did some tests to see if GROUP BY can help and it seems it can.

On table A with ~8M records and ~340K distinct records for a given non-indexed field:

GROUP BY           17 seconds
COUNT(DISTINCT ..) 21 seconds

On table A with ~2M records and ~50K distinct records for a given indexed field:

GROUP BY           200 ms
COUNT(DISTINCT ..) 2.5 seconds

This is MySql with InnoDB engine, BTW.

I can't find any relevant documentation though, and I wonder if that comparison is dependent on the data (how many duplicates there are).

For your table, the GROUP BY query will look like this:

SELECT COUNT(t.c)
FROM (SELECT 1 AS c
      FROM Web_Request_Log
      WHERE Added_Timestamp LIKE '20110312%'
      AND User_ID IS NOT NULL
      GROUP BY User_ID
      ) AS t

Try it and let us know if it's quicker :)

edited Apr 7, 2011 at 16:46

answered Apr 7, 2011 at 16:40

Galz

6,8525 gold badges35 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lee Hinde Over a year ago

Thank you; I'll try it and let you know.

Lee Hinde Over a year ago

so, I tried and the request timed out: Error Code: 2006 MySQL server has gone away. Also, I'm running on RDS and the timing you're getting (21 seconds in a table with 8 million records) is much, much better than what I'm getting.

Collectives™ on Stack Overflow

Best approach to getting count of distinct values in MySQL

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related